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MULTIPLE  INSTANCE  FUZZY  INFERENCE 

ABSTRACT 

A  novel  fuzzy  learning  framework  that  employs  fuzzy  inference  to  solve  the  problem 
of  multiple  instance  learning  (MIL)  is  presented.  The  framework  introduces  a  new  class  of 
fuzzy  inference  systems  called  Multiple  Instance  Fuzzy  Inference  Systems  (MI-FIS). 

Fuzzy  inference  is  a  powerful  modeling  framework  that  can  handle  computing  with  knowledge 
uncertainty  and  measurement  imprecision  ecfeetively.  Fuzzy  Inference  performs  a  nonlinear 
mapping  from  an  input  space  to  an  output  space  by  deriving  conclusions  from  a  set 
of  fuzzy  if-then  rules  and  known  facts.  Rules  can  be  idcntfycd  from  expert  knowledge,  or 
learned  from  data. 

In  multiple  instance  problems,  the  training  data  is  ambiguously  labeled.  Instances  are 
grouped  into  bags,  labels  of  bags  are  known  but  not  those  of  individual  instances.  MIL 
deals  with  learning  a  classfycr  at  the  bag  level.  Over  the  years,  many  solutions  to  this  problem 
have  been  proposed.  However,  no  MIL  formulation  employing  fuzzy  inference  exists  in 
the  literature. 

In  this  dissertation,  we  introduce  multiple  instance  fuzzy  logic  that  enables  fuzzy  reasoning 
with  bags  of  instances.  Accordingly,  dioerent  multiple  instance  fuzzy  inference  styles  are 
proposed.  The  Multiple  Instance  Mamdani  style  fuzzy  inference  (MI-Mamdani)  extends 
the  standard  Mamdani  style  inference  to  compute  with  multiple  instances.  The  Multiple 

Instance  Sugeno  style  fuzzy  inference  (Ml-Sugeno)  is  an  extension  of  the  standard  Sugeno  style  inference  to  handle 
reasoning  with  multiple  instances. 

In  addition  to  the  MI-FIS  inference  styles,  one  of  the  main  contributions  of  this  work  is  an 
adaptive  neuro-fuzzy  architecture  designed  to  handle  bags  of  instances  as  input  and  capable 
of  learning  from  ambiguously  labeled  data.  The  proposed  architecture,  called  Multiple 
Instance-ANFIS  (MI-ANFIS),  extends  the  standard  Adaptive  Neuro  Fuzzy  Inference  System 
(ANFIS). 

We  also  propose  dio  erent  methods  to  identify  and  learn  fuzzy  if-then  rules  in  the  context 
of  MIL.  In  particular,  a  novel  learning  algorithm  for  MI-ANFIS  is  derived.  The  learning  is 
achieved  by  using  the  backpropagation  algorithm  to  identify  the  premise  parameters  and 
consequent  parameters  of  the  network. 

The  proposed  framework  is  tested  and  validated  using  synthetic  and  benchmark  datasets 
suitable  for  MIL  problems.  Additionally,  we  apply  the  proposed  Multiple  Instance  Inference 
to  the  problem  of  region-based  image  categorization  as  well  as  to  fuse  the  output  of  multiple 
discrimination  algorithms  for  the  purpose  of  landmine  detection  using  Ground  Penetrating 
Radar. 
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ABSTRACT 


MULTIPLE  INSTANCE  FUZZY  INFERENCE 

Amine  Ben  Khalifa 

December  2,  2015 

A  novel  fuzzy  learning  framework  that  employs  fuzzy  inference  to  solve  the  problem 
of  multiple  instance  learning  (MIL)  is  presented.  The  framework  introduces  a  new  class  of 
fuzzy  inference  systems  called  Multiple  Instance  Fuzzy  Inference  Systems  (MI-FIS). 

Fuzzy  inference  is  a  powerful  modeling  framework  that  can  handle  computing  with  knowl¬ 
edge  uncertainty  and  measurement  imprecision  effectively.  Fuzzy  Inference  performs  a  non¬ 
linear  mapping  from  an  input  space  to  an  output  space  by  deriving  conclusions  from  a  set 
of  fuzzy  if-then  rules  and  known  facts.  Rules  can  be  identified  from  expert  knowledge,  or 
learned  from  data. 

In  multiple  instance  problems,  the  training  data  is  ambiguously  labeled.  Instances  are 
grouped  into  bags,  labels  of  bags  are  known  but  not  those  of  individual  instances.  MIL 
deals  with  learning  a  classifier  at  the  bag  level.  Over  the  years,  many  solutions  to  this  prob¬ 
lem  have  been  proposed.  However,  no  MIL  formulation  employing  fuzzy  inference  exists  in 
the  literature. 

In  this  dissertation,  we  introduce  multiple  instance  fuzzy  logic  that  enables  fuzzy  reasoning 
with  bags  of  instances.  Accordingly,  different  multiple  instance  fuzzy  inference  styles  are 
proposed.  The  Multiple  Instance  Mamdani  style  fuzzy  inference  (MI-Mamdani)  extends 
the  standard  Mamdani  style  inference  to  compute  with  multiple  instances.  The  Multiple 
Instance  Sugeno  style  fuzzy  inference  (Ml-Sugeno)  is  an  extension  of  the  standard  Sugeno 
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style  inference  to  handle  reasoning  with  multiple  instances. 

In  addition  to  the  MI-FIS  inference  styles,  one  of  the  main  contributions  of  this  work  is  an 
adaptive  neuro-fuzzy  architecture  designed  to  handle  bags  of  instances  as  input  and  capa¬ 
ble  of  learning  from  ambiguously  labeled  data.  The  proposed  architecture,  called  Multiple 
Instance- ANFIS  (MI-ANFIS),  extends  the  standard  Adaptive  Neuro  Fuzzy  Inference  Sys¬ 
tem  (ANFIS). 

We  also  propose  different  methods  to  identify  and  learn  fuzzy  if-then  rules  in  the  context 
of  MIL.  In  particular,  a  novel  learning  algorithm  for  MI-ANFIS  is  derived.  The  learning  is 
achieved  by  using  the  backpropagation  algorithm  to  identify  the  premise  parameters  and 
consequent  parameters  of  the  network. 

The  proposed  framework  is  tested  and  validated  using  synthetic  and  benchmark  datasets 
suitable  for  MIL  problems.  Additionally,  we  apply  the  proposed  Multiple  Instance  Inference 
to  the  problem  of  region-based  image  categorization  as  well  as  to  fuse  the  output  of  multiple 
discrimination  algorithms  for  the  purpose  of  landmine  detection  using  Ground  Penetrating 
Radar. 
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CHAPTER  1 


INTRODUCTION 

Fuzzy  inference  is  a  powerful  modeling  framework  that  can  handle  computing  with 
knowledge  uncertainty  and  measurements  imprecision  effectively  [2].  It  is  a  process  based 
on  the  concepts  of  fuzzy  set  theory  and  fuzzy  reasoning.  It  performs  a  non-linear  mapping 
from  an  input  space  to  an  output  space  by  deriving  conclusions  from  a  set  of  fuzzy  if-then 
rules  and  known  facts  [3].  Fuzzy  inference  has  been  successfully  applied  to  a  wide  range 
of  problems,  mainly  in  system  modeling  and  control  [4-14].  Most  of  the  proposed  fuzzy 
inference  methods  gained  success  because  of  their  ability  to  leverage  expert  knowledge  to 
identify  the  model  parameters  [15].  This  practice  simplifies  system  design  and  ensures  that 
the  knowledge  base  (if-then  rules)  used  by  the  system  is  easy  to  interpret  [16] . 

More  recently,  fuzzy  inference  has  increasingly  been  applied  to  more  advanced  ap¬ 
plications,  such  as  content-based  information  retrieval  [17],  image  segmentation  [18],  image 
annotation  [19],  pattern  recognition  [20],  recommender  systems  [21,22],  and  multiple  clas¬ 
sifier  fusion  [23].  The  aforementioned  applications  are  more  challenging  as  they  require  an 
extensive  knowledge  base  to  accommodate  for  various  scenarios.  Since  this  diverse  knowl¬ 
edge  base  cannot  be  fully  provided  by  domain  experts,  data-driven  techniques  are  typically 
used  to  identify  and  learn  the  fuzzy  inference  system’s  parameters  [24,25].  In  this  later 
technique,  supervised  and  unsupervised  learning  algorithms  are  devised  to  learn  the  param¬ 
eters  of  the  fuzzy  inference  system  (i.e.  learn  the  knowledge  base)  from  a  set  of  labeled 
training  data.  For  instance,  a  clustering  algorithm  (unsupervised  learning)  can  be  used  to 
identify  local  contexts  of  the  input  space,  and  a  linear  classifier  (supervised  learning)  can 
be  used  to  learn  decisions  within  each  of  the  contexts.  Thus,  substituting  the  traditional 
expert  knowledge  based  system’s  identification  methods,  with  more  scalable,  adaptive,  and 
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broader  learning  methods. 

Typically,  in  supervised  learning  problems,  access  to  large  labeled  training  datasets 
improves  the  performance  of  the  devised  algorithms  by  increasing  their  robustness  and 
generalization  capabilities.  Nowadays,  access  to  such  large  datasets  is  becoming  more  con¬ 
venient.  In  fact,  we  generate  about  2.5  quintillion  bytes  of  data  everyday  1  [26,27].  This 
data  is  continuously  collected  from  sensors  that  measure  environmental  information,  posts 
to  social  media  sites  such  as  flickr  [28],  digital  pictures  and  videos  uploaded  to  advertise¬ 
ment  websites  such  as  Craigslist  [29],  etc.  This  trend  is  not  expected  to  slowdown  anytime 
soon  and  is  fueled  by  the  drastic  decrease  in  the  cost  of  data  storage  [30].  However,  for  a 
supervised  leaning  method  to  benefit  from  this  data,  it  needs  to  be  carefully  preprocessed, 
filtered,  and  labeled.  Unfortunately,  this  process  can  be  too  tedious  as  the  vast  portion  of 
the  collected  data  is  unstructured  with  few  tags  that  label  the  object  at  a  high  level  (e.g. 
social  media  images).  To  overcome  this  lack  of  labeled  data,  many  recent  developments  use 
crowdsourcing  services  such  as  Amazon  Mechanical  Turk  [31]  to  hire  an  on-demand  human 
workforce  over  the  internet  to  assign  labels  to  data  points.  For  instance,  a  tool  named 
“Labelme”  by  MIT  [32]  could  be  used  for  this  purpose.  Similarly,  Google  started  using  its 
Captcha  service,  reCaptcha  [33],  to  label  address’  digits  collected  from  Street  View  images 
for  the  purpose  of  a  deep  neural  network  training  [34],  Despite  the  scalability  of  many 
recent  machine  learning  algorithms,  they  still  require  the  full  engaged  cognition  of  a  human 
being  to  assign  labels  at  a  finer  level  (e.g.  label  regions  within  images).  Unfortunately,  this 
process  is  ambiguous,  subjective,  and  prone  to  errors  (e.g.  difficulty  to  select  an  object  of 
interest  within  an  image). 

To  summarize,  large  amounts  of  data  are  available  and  could  be  used  for  learning. 
However,  this  data  is  typically  labeled  ambiguously  and  at  a  coarse  level.  In  fact,  labels, 
or  tags,  tend  to  be  associated  with  collections  of  samples  rather  than  single  samples.  For 
example,  in  image  annotation,  tags  could  be  used  as  indicators  of  the  existence  of  objects 
of  interests  within  the  images  (sky,  sea,  beach,. . .).  However,  the  exact  location  of  those 
190%  of  the  data  in  the  world  today  has  been  created  in  the  last  two  years  alone. 
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objects  is  not  available  and  is  too  tedious  to  extract  for  large  collection  of  images.  An 
alternative  and  a  relatively  new  framework  of  learning  that  tackles  the  inherent  ambiguity 
better  than  supervised  learning,  is  the  Multiple  Instance  Learning  (MIL)  paradigm  [35]. 

1.1  Multiple  Instance  Learning 

Unlike  standard  supervised  learning,  in  MIL,  an  object  is  not  represented  by  a  simple 
data  point,  but  rather  by  a  collection  of  instances,  called  a  bag.  Each  bag  can  contain  a 
different  number  of  instances.  In  MIL,  a  bag  is  labeled  negative  if  all  of  its  instances 
are  negative,  and  positive  if  at  least  one  of  its  instances  is  positive  (positive  bags  may 
also  contain  negative  instances).  Positive  bags  can  encode  ambiguity  since  the  instances 
themselves  are  not  labeled.  Given  a  training  set  of  labeled  bags,  the  goal  of  MIL  is  to  learn 
a  concept  that  predicts  the  labels  of  training  data  at  the  instance  level  and  generalizes  to 
predict  the  labels  of  testing  bags  and  their  instances  [36]. 

The  MIL  problem  was  first  formalized  by  Dietterich  et  al.  [37]  providing  a  solu¬ 
tion  to  drug  activity  prediction.  Ever  since,  it  has  increasingly  been  applied  to  a  wide 
variety  of  tasks.  Some  of  the  applications  include  content-based  information  retrieval  [38], 
drug  discovery  [39],  pattern  recognition  [40],  image  classification  [41],  text  classification  [42], 
region-based  image  categorization  [43],  image  annotation  [44],  object  tracking  [45]  and  time- 
series  prediction  [35],  to  name  a  few.  To  illustrate  the  need  for  MIL,  in  the  following  we 
analyse  how  a  multiple  instance  (MI)  representation  can  be  applied  to  image  classification. 

Consider  the  simple  example  of  classifying  images  that  contain  “sky” .  In  this  prob¬ 
lem,  for  an  input  image  we  want  to  determine  whether  a  region  that  contains  sky  is  present 
in  the  image.  Using  an  MIL  approach,  each  training  image  is  represented  by  a  bag  of  in¬ 
stances  where  each  instance  corresponds  to  a  segmented  region  of  interest.  These  regions 
could  be  obtained  by  dividing  images  into  patches.  A  multiple  instance  representation  is 
well  suited  for  this  purpose  because  only  few  regions  may  contain  the  object  of  interest  (sky), 
that  is  the  positive  class.  Other  patches  will  be  from  background  or  other  classes.  This 
representation  is  illustrated  in  Figure  1.1.  Traditional,  single  instance  learning  methods 
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are  based  on  instance  level  (patch-level)  labels  and  would  require  the  image  to  be  correctly 
segmented  and  labeled  prior  to  learning. 


Bag:  image  ={instances  :patches} 


Figure  1.1:  Example  of  an  image  represented  as  a  bag  of  12  instances.  Each  instance  could 
be  a  feature  vector  extracted  from  one  patch.  The  bag  is  labeled  “sky”  because  at  least  one 
of  its  instances  is  sky.  However,  many  other  instances  are  not  “sky” .  Labels  at  the  instance 
level  are  not  available. 


1.2  Fuzzy  Logic  and  Fuzzy  Inference  Systems 

Fuzzy  logic  [4]  is  a  computational  framework  that  makes  use  of  fuzzy  set  theory  and 
fuzzy  assignment  of  elements  to  sets.  In  classical  set  theory,  also  known  as  crisp  sets,  an 
element  is  either  a  member  of  a  set  or  not.  Whereas,  in  fuzzy  set  theory,  an  element  is 
characterized  by  a  degree  of  membership,  usually  a  real  number  between  0  and  1.  Fuzzy 
logic,  in  contrast  to  traditional  two-valued  (boolean)  logic,  uses  the  elements’  membership 
degrees  to  evaluate  the  degree  of  truth  of  logical  propositions.  Hence,  the  degree  of  truth 
is  non-crisp,  or  soft.  This  enables  fuzzy  logic  to  be  characterized  by  linguistic  terms  rather 
than  by  numbers.  For  example,  in  fuzzy  logic,  a  fuzzy  proposition  can  have  the  following 
expression:  “patch  is  blue” ,  in  which  the  linguistic  term  “blue”  is  a  fuzzy  set  that  describes 
color  intensity.  Fuzzy  logic  simulates  human  imprecise  understanding  of  the  world,  and  can 
be  viewed  as  a  framework  for  computing  with  words  [46]. 

A  Fuzzy  Inference  System  (FIS)  is  a  paradigm  in  soft  computing  which  provides 
a  means  of  approximate  reasoning  [47].  A  FIS  is  capable  of  handling  computing  with 
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knowledge  uncertainty  and  measurements  imprecision  effectively  [2].  It  performs  a  non¬ 
linear  mapping  from  an  input  space  to  an  output  space  by  deriving  conclusions  from  a 
set  of  fuzzy  if-then  rules  and  known  facts.  Fuzzy  rules  are  condition/action  (if-then)  rules 
composed  of  a  set  of  linguistic  variables  (e.g.  patch)  which  can  each  take  on  linguistic  terms 
(e.g.  red,  green,  blue).  For  example,  the  following  rules  could  be  used  to  identify  patches 
from  the  image  in  Figure  1.1: 

•  If  patch  is  blue  then  region  is  sky. 

•  If  patch  is  blue  and  patch  position  is  upper  half  then  region  is  sky. 

•  If  patch  is  yellow  and  patch  position  is  lower  half  then  region  is  beach. 

Typically,  a  FIS  is  composed  of  5  components.  First,  a  Fuzzification  unit  that 
assigns  a  membership  degree  to  each  crisp  input  dimension  in  the  input  fuzzy  sets.  Second, 
a  Knowledge  Base  characterized  by  fuzzy  sets  of  linguistic  terms.  Third,  a  Rule  Base 
containing  a  set  of  fuzzy  if-then  rules.  Fourth,  an  Inference  unit  that  performs  fuzzy 
reasoning.  Finally,  a  Deffuzification  unit  that  generates  crisp  output  values.  Mamdani 
[48]  and  Sugeno  [49]  are  the  two  commonly  used  fuzzy  inference  systems.  A  graphical 
representation  of  a  generic  FIS  is  shown  in  Figure  1.2. 


Figure  1.2:  A  graphical  representation  of  a  FIS  and  its  components. 
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1.3  Motivations  and  Contributions 


1.3.1  Motivations 

Consider  the  example  of  sky  image  classification  presented  earlier.  Let  us  suppose 
we  have  assembled  a  training  dataset  with  images  labeled  as  positive  if  they  contain  sky, 
negative  otherwise.  Clearly,  the  data  is  ambiguously  labeled  (i.e.  labels  are  available  only 
at  the  bag  level,  and  individual  patches  are  not  labeled).  We  want  to  train  a  FIS  capable  of 
recognizing  images  containing  sky  (i.e.  produce  a  high  output  when  the  test  image  contains 
a  sky  region).  To  do  so,  the  FIS  needs  to  learn  rules  capable  of  describing  the  concept  of 
sky.  For  the  human  perception,  the  sky  concept  can  be  described  as  a  blue  region  that  is 
located  in  the  upper  half  of  the  frame.  Thus,  one  possibility  is  to  extract  2  features:  Color 
Intensity,  and  Vertical  Position  of  a  region.  Doing  so,  features  need  to  be  extracted  locally 
at  the  patch  level.  Hence,  turning  the  problem  into  a  multiple  instance  problem,  an  image 
is  a  bag  of  instances  (with  label  only  at  the  bag  level).  On  the  other  hand,  extracting  one 
global  feature  vector  covering  the  whole  image  will  lead  to  confusions  and  will  not  be  able 
to  describe  concepts  effectively  because  the  features  will  describe  non  homogenous  regions 
and  will  be  based  on  averages.  Because  of  the  uncertainty  and  subjectivity  of  describing 
the  color  of  a  patch  and  its  vertical  position  within  the  image,  the  two  features  are  better 
represented  as  fuzzy  sets.  Color  intensity  feature  can  be  described  by  means  of  3  linguistic 
terms:  Red,  Green,  and  Blue.  While  vertical  position  can  be  described  with  linguistic 
terms,  Upper  Half,  Middle,  and  Lower  Half.  Figure  1.3  shows  a  graphical  representation  of 
membership  degrees  of  the  2  features  in  the  different  linguistic  terms  (fuzzy  sets). 

Clearly,  this  representation  is  close  to  the  way  humans  perceive  the  patches  of  the 
image  in  Figure  1.1.  Due  to  the  absence  of  labels  at  the  patch  level,  and  therefore  absence 
of  feedback,  FIS  training  could  not  be  achieved.  Nonetheless,  in  this  particular  example 
the  concept  of  the  sky  can  be  considered  trivial.  Thus,  the  FIS  can  be  designed  by  lever¬ 
aging  expert  knowledge  that  can  lead  to  using  the  following  rule  for  the  purpose  of  patch 
classification. 
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Color  Intensity  Vertical  Position 

(a)  A  graphical  representation  of  3  fuzzy  sets  de-  (b)  A  graphical  representation  of  3  fuzzy  sets  de¬ 
scribing  the  Color  Intensity  feature.  scribing  the  Vertical  Position  feature. 

Figure  1.3:  Linguistic  terms  of  Color  Intensity  and  Vertical  Position  features. 

•  If  patch  is  blue  and  patch  position  is  upper  half  then  region  is  sky. 

To  classify  the  image  correctly,  the  results  of  patches’  classification  need  to  be  aggregates 
to  produce  a  final  output. 

There  are  two  major  limitations  that  prevent  using  standard  FIS  methods  with 
multiple  instance  data.  First,  due  to  the  absence  of  labels  at  the  instance  level,  we  cannot 
use  standard  FIS  learning  methods  to  construct  the  knowledge  base.  Second,  we  need  an 
effective  mechanism  to  aggregate  instances’  confidences  and  infer  at  the  bag  level. 

The  limitations  are  due  mainly  to  the  inherent  architecture  of  fuzzy  inference  systems.  The 
generic  inference  system  shown  in  Figure  1.2  reasons  with  individual  instances.  First,  the 
system’s  input  is  an  individual  instance.  Second,  the  rules  describe  fuzzy  regions  within 
the  instances  space.  Third,  the  output  of  the  system  corresponds  to  the  fuzzy  inference 
using  a  single  instance.  Fourth,  labels  of  the  individual  instances  are  required  when  using 
learning  techniques  to  identify  the  parameters  of  the  system.  In  summary,  traditional  fuzzy 
inference  systems  cannot  be  used  effectively  within  the  MIL  framework. 

To  address  the  above  limitations  we  propose  to  generalize  fuzzy  inference  to  extend 
it  to  reason  with  bags  of  instances. 

1.3.2  Contributions 

In  this  dissertation,  we  propose  developing  a  Multiple  Instance  Fuzzy  Inference 
framework.  In  particular,  we  propose: 
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1.  Developing  Multiple  Instance  Fuzzy  Logic  that  generalizes  traditional  fuzzy  logic  to 
compute  with  bags  of  instances.  Under  this  work,  we  propose  multiple  instance  gen¬ 
eralization  of  fuzzy  propositions,  fuzzy  if-then  rules,  fuzzy  implication,  and  fuzzy 
reasoning. 

2.  Extending  Mamdani  and  Sugeno  fuzzy  inference  systems  to  reason  with  bags  instead 
of  individual  instances  using  the  developed  Multiple  Instance  Fuzzy  Logic.  We  call 
the  new  inference  systems  Multiple  Instance-Mamdani  (MI-Mamdani)  and  Multiple 
Instance-Sugeno  (Ml-Sugeno). 

3.  Developing  methods  to  identify  and  learn  multiple  instance  fuzzy  if-then  rules  from 
ambiguously  labeled  data. 

4.  Extending  the  standard  Adaptive  Neuro-Fuzzy  Inference  System  (ANFIS)  [50]  to  rea¬ 
son  with  bags  of  instances  as  input  and  to  learn  from  ambiguously  labeled  data.  We 
call  the  new  neuro- fuzzy  architecture  Multiple  Instance- ANFIS  (MI- ANFIS). 

5.  Developing  a  learning  algorithm  to  learn  the  parameters  of  the  proposed  MI- ANFIS 
neuro-fuzzy  inference  system. 

The  remainder  of  this  dissertation  is  organized  as  follows.  Chapter  2  provides  a 
review  of  multiple  instance  learning,  fuzzy  logic,  and  common  fuzzy  inference  systems. 
Chapter  3  introduces  our  proposed  multiple  instance  fuzzy  logic  framework.  Chapter  4 
introduces  our  proposed  MI-Mamdani  and  Ml-Sugeno  inference  systems.  Chapter  5  intro¬ 
duces  our  proposed  MI-ANIFS  neuro-fuzzy  architecture.  Chapter  6  provides  experimental 
results  and  analysis  of  the  proposed  methods.  Finally,  chapter  7  provides  conclusions  and 
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CHAPTER  2 


BACKGROUND 

In  this  chapter,  we  provide  background  material  that  is  relevant  to  our  research.  We 
start  with  a  review  of  the  Multiple  Instance  Learning  problem  and  give  brief  examples  to 
motivate  the  need  for  this  learning  paradigm.  Next,  we  provide  an  overview  of  fuzzy  logic. 
Finally,  we  provide  an  overview  of  common  fuzzy  inference  systems. 

2.1  Multiple  Instance  Learning 

Multiple  Instance  Learning  (MIL)  is  a  supervised  learning  paradigm  that  aims  at 
solving  classification  and  regression  problems  by  devising  algorithms  capable  of  learning 
from  ambiguously  labeled  data  [51].  In  standard  supervised  learning,  each  example  is 
represented  by  a  fixed-length  vector  of  features.  In  MIL,  an  example  is  a  collection  of  feature 
vectors  (instances),  called  a  bag.  Each  bag  can  contain  a  different  number  of  instances. 
Labels  of  bags  are  known  but  not  those  of  individual  instances.  A  bag  is  labeled  negative 
if  all  of  its  instances  are  negative,  and  positive  if  at  least  one  of  its  instances  is  positive. 
Positive  bags  can  encode  ambiguity  since  the  instances  themselves  are  not  labeled.  Given 
a  training  set  of  labeled  bags,  the  goal  of  MIL  is  to  learn  a  concept  that  predicts  the  labels 
of  training  data  and  generalizes  to  predict  the  labels  of  testing  bags  [36].  The  difference 
between  standard  supervised  learning  and  multi-instance  learning  is  illustrated  in  Figure 
2.1. 

The  problem  of  MIL  arises  naturally  in  many  scenarios.  It  was  first  applied  by 
Dietterich  et  al.  to  provide  a  solution  to  drug  activity  prediction  [37].  Ever  since,  it  has 
increasingly  been  applied  to  a  wide  variety  of  tasks  such  as  content-based  information  re¬ 
trieval  [38],  pattern  recognition  [40],  image  classification  [41],  text  classification  [42],  region- 
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Object 

(a)  The  standard  supervised  learning  paradigm. 

Instance-1 
lnstance-2- 
lnstance-3- 
Instance-M 

(b)  The  multiple  instance  learning  paradigm. 

Figure  2.1:  Difference  between  standard  supervised  learning  and  multiple  instance  learning.1 

based  image  categorization  [43],  image  annotation  [44],  object  tracking  [45]  and  timeseries 
prediction  [35],  to  name  a  few.  MIL  has  a  broader  domain  of  application  beyond  those 
few  examples.  Maron  et  al.  [35]  presented  a  methodology  to  transform  difficult  learning 
problems  into  Multiple-Instance  learning  problems. 

In  general,  MIL  can  be  applied  in  two  contexts  of  ambiguity:  “polymorphism  am¬ 
biguity”  and  “part-whole  ambiguity”  [52],  In  polymorphism  ambiguity,  an  object  can  have 
multiple  forms  of  expression  in  the  input  space  and  it  is  not  known  which  form  is  responsible 
for  the  object  label.  Whereas,  in  part-whole  ambiguity,  an  object  can  be  broken  into  several 
parts  represented  by  different  feature  vectors  in  the  input  space.  However,  only  few  parts 
are  responsible  for  the  object  label  [53].  In  the  following  we  briefly  describe  two  application 
domains  related  to  the  two  distinct  ambiguity  concepts. 

1.  Polymorphism  Ambiguity  arise  more  often  in  applications  related  to  chemistry 
and  bioscience.  The  original  MIL  application  of  drug  discovery  [35,  36]  is  a  case 
of  polymorphism  ambiguity.  In  this  type  of  applications,  typically,  the  goal  is  to 
classify  molecules  by  looking  at  their  shapes.  Each  molecule  can  appear  in  several 
distinct  shapes  because  of  binding  and  twisting  that  might  occur  during  interactions. 
Thus,  a  molecule  can  have  different  forms  of  expression.  However,  it  is  a  tedious 
process  to  identify  which  form  is  responsible  for  the  molecule  behaviour  (label).  Hence, 
1  Figure  based  on  [36]. 
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the  problem  is  better  represented  as  a  multiple  instance  problem.  A  more  recent 
application  that  presents  polymorphism  ambiguity  is  genomic  data  analyses  [54],  In 
this  type  of  applications  a  gene  is  represented  by  multiple  isoforms,  the  goal  is  to 
predict  the  gene-level  function.  Typically,  this  problem  is  a  multiple  instance  problem. 

2.  Part- whole  Ambiguity:  This  type  of  ambiguity  is  more  common  in  pattern  recogni¬ 
tion  problems.  For  example,  for  an  image  annotation  application  such  as  presented  in 
Section  1.1,  usually  features  are  extracted  locally  (from  patches)  with  labels,  or  tags, 
available  only  at  the  image  level,  making  the  problem  a  multiple  instance  problem. 
Another  closely  related  application  is  object  detection.  In  this  application  objects  of 
interest  cover  only  a  limited  region  of  the  image,  the  rest  could  be  other  objects  or 
background.  For  the  task  of  training  a  classifier  to  detect  the  object,  traditionally, 
tedious  human  labor  is  required  to  extract  patches  containing  the  object  and  labeling 
them.  As  indicated  by  Viola  et  al.  [55],  placing  bounding  boxes  around  objects  is  an 
inherently  ambiguous  task.  Thus,  it  is  more  convenient  to  solve  the  problem  of  object 
detection  using  the  MIL  paradigm,  which  in  turn  encodes  ambiguity  effectively.  The 
part-whole  ambiguity  also  arise  in  other  applications  such  as  computer  audition  [56] 
and  text  document  classification  [57].  These  applications  are  similar  to  object  detec¬ 
tion:  features  are  extracted  from  audio  segments  or  text  paragraphs,  and  labels  are 
only  available  at  the  audio  clip  level  or  text  document  level,  respectively. 

We  now  review  some  of  the  common  algorithms  that  have  been  proposed  to  solve  the 
multiple  instance  problem  and  are  related  to  our  research. 

2.1.1  Diverse  Density 

The  most  commonly  referenced  MIL  algorithm  found  in  the  literature  is  Diverse 
Density  (DD).  It  was  first  introduced  by  Maron  et  al.  [39].  The  objective  of  DD  is  to  find  a 
“soft”  set  that  describes  the  intersection  of  the  positive  bags  minus  the  union  of  the  negative 
bags.  To  achieve  this,  DD  attempts  to  find  a  concept  in  the  feature  space  (instance  space) 
that  is  close  to  at  least  one  instance  from  every  positive  bag  but  far  away  from  instances  in 
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the  negative  bags. 

If  we  define  diverse  density  as  a  measure  of  how  many  different  positive  bags  have  instances 
near  a  given  point  of  the  input  space,  and  how  far  the  negative  instances  are  from  that 
point,  then  a  concept  as  defined  by  Maron  et  al.  [39]  as  a  point  with  maximum  diverse 
density. 

Formally,  if  the  training  data  is  presented  as  positive  bags,  denoted  B±,  , . . . ,  B+, 

and  negative  bags,  denoted  ,  ■  ■  ■ ,  B~x,  the  diverse  density  of  a  given  concept  t  is 

defined  as  the  probability  that  t  is  the  correct  concept. 

DD(t)  =  Pr(t  |  B+,  B+, . . . ,  B+,  Sf,  B2"  ■ . . . ,  B~).  (2.1) 

Using  Bayes’  rule  and  under  the  assumption  that  all  bags  are  conditionally  independent 
given  the  true  target  concept,  it  was  shown  that  (2.1)  can  be  decomposed  into: 

DD(t)  =  n  Pr(B+  |  t)  f]  Pr(B-  \  t).  (2.2) 

l<i<n  1  <i<m 

Using  Bayes’  rule  further  and  under  the  assumption  of  constant  priors,  Maron  showed  that 
optimizing  DD  is  equivalent  to  optimizing  DD ,  defined  as 

DD(t)=  n  Pr(t\B+)  n  Pr(t\B~).  (2.3) 

l<i<n  1  <i<m 

Instead  of  maximizing  DD ,  a  common  practice  is  to  minimize  the  negative  log- likelihood 
given  by: 

—logDD(t)  =  -  l°9(Pr(t  \  B+))  +  ^  log(Pr(t  \  B~))  .  (2.4) 

_  l<i<n  1  <i<m 

This  formulation  is  more  robust  against  very  small  probabilities. 

To  compute  Pr(t  \  B{)  for  a  given  bag  B, ,  a  conjunction  measure  of  all  its  instances  Bt] , 
j  =  1, . . . ,  M  is  computed  using  the  noisy-or  operator 

Pr{t  |  Bj)  =  1  -  n  (!- Pr(Bv  e  *))>  (2-5) 

1  <j<M 

where  Pr  ( B,j  G  t)  is  computed  from  a  Gaussian  distribution  centred  at  the  concept  point 
t. 
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To  optimize  the  above  cost  function  (2.4),  gradient  descent  can  be  used  to  find  an  optimal 
target  concept  t.  Another  optimization  technique  that  can  be  used  to  find  the  most  likely 
concept  is  EM-DD  [58] .  The  basic  idea  behind  EM-DD  is  to  view  “the  knowledge  of  which 
instance  corresponds  to  the  label  of  the  bag  as  a  missing  attribute  which  can  be  estimated 
using  the  Expectation  Maximization  (EM)  approach”.  The  EM-DD  starts  by  taking  an 
initial  guess  from  positive  instances  as  a  target  concept,  then  alternates  between  two  steps: 
In  the  first  step,  the  current  concept  is  used  to  pick  one  instance  from  each  bag  which  is 
most  likely  responsible  for  the  bag  label,  and  in  the  second  step,  find  a  new  target  concept  t' 
by  maximizing  the  likelihood  over  all  negative  instances  and  the  positive  instances  identified 
by  the  first  step. 

Once  concepts  are  identified  using  DD  or  EM-DD,  the  label  for  an  unseen  bag  Bnew 
(with  M  instances)  in  a  given  concept  t  is  estimated  as  following: 

Label(Bnew  \  t )  =  max^exp(-(Bnew,k  -  t)2)  j,  k  =  1, . . . ,  M.  (2.6) 

•  Multi-target  concept  Diverse  Density  (MDD) 

The  MDD  is  a  new  metric  developed  by  Karem  and  Frigui  [59]  for  the  purpose  of 
fuzzy  clustering  of  multiple  instance  data  (FCMI).  This  approach  extends  the  stan¬ 
dard  Diverse  Density  (DD)  metric  established  by  Maron  to  accommodate  more  than 
one  positive  target  concept.  The  governing  assumption  behind  this  extension  is  that 
there  exist  MIL  problems  for  which  a  single  target  concept  inadequately  represents 
the  feature  space. 

In  MDD,  there  are  multiple  target  concepts  {C\, . . .  ,Cr},  and  each  bag  is  assigned 
memberships  to  multiple  target  concepts.  This  membership  assignment  is  conducted 
by  selecting  the  concept  that  maximizes  the  noisy-or  measure  (2.5).  Once  member¬ 
ships  are  assigned  to  each  bag,  the  target  concepts  are  optimized  separately  following 
the  pattern  of  the  general  DD  methodology.  The  MDD  metric  is  given  by  the  follow¬ 
ing: 
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(2.7) 


N  r 

MDD( T,U)  =  n  n  [Pr(Ci\Bn)]uZ. 

n= 1  i= 1 

In  (4.8),  U  =  [ Uin ]  is  a  membership  matrix  such  that  each  bag  Bn  is  assigned  to 
target  concept  C*  with  membership  degree  Uin,  and  m  is  a  fuzzifier  that  controls  the 
fuzziness  of  the  partitions  as  in  the  FCM  [60].  Pr(Ci\Bn )  is  the  probability  that  Ci 
is  a  target  concept  given  Bn,  and  defined  as 

,  ,  f  l-n£li(l  -PrtxraeCi))  if  label(Bn)  =  1, 

Pv(Ci\Bn )  —  \  (2.8) 

l  nf=i(l  -  Pr{xnk  e  Ci))  if  label(Bn)  =  0 

where  label(Bn)  is  the  label  of  bag  Bn  and  xnk  is  the  kth  instance  of  bag  Bn. 
Pr(Xnj.  E  Cf)  is  regarded  as  the  similarity  of  instance  Xnk  to  target  concept  Q,  and 
its  computed  using 

Pr(Xnk  G  Ci)  =  c»j)2)  (2.9) 

In  (4.5),  Sij  is  a  scaling  parameter  that  weights  the  role  of  feature  j  in  target  concept 
i  [39]. 


2.1.2  Multiple  Instance  Regression 


Multiple  instance  regression  (Mi-regression)  was  first  introduced  by  Ray  and  Page 
[61].  In  Mi-regression,  bags  are  associated  with  real- valued  labels  instead  of  the  usual 
binary  class  labels  (positive/negative).  Similarly  to  the  standard  regression,  the  task  of 
Mi-regression  is  to  predict  a  real-valued  bag  label. 

Ray  and  Page  assumed  that  every  bag  has  a  primary  instance  responsible  for  the  bag  label. 

Under  this  assumption  an  ideal  regression  model  is  a  hyperplane  Y  =  Xb  such  that 

N 


b  =  argmin  ^  L (yi,Xip,  b), 


(2.10) 


i=  1 


where  N  is  the  number  of  bags,  yt  is  the  real- valued  label  of  bag  Bj,  X^p  is  the  primary 
instance  of  the  ith  bag,  and  L  is  a  loss  function  defined  by 


L(y«,  Xij ,  b)  —  ( yi  Xijb)  . 


(2.11) 
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Equation  (2.10)  assumes  that  the  primary  instance  Xip  is  known  during  training.  However, 
this  is  not  the  case  for  most  MIL  problems.  To  overcome  this  issue,  Ray  and  Page  proposed 
to  use  the  “best  fit”  hyperplane  instead: 

N 

b  =  argmin  >  minL(yi,  X^,  b),  j  =  1, . . . ,  Mt  (2-12) 

h  7^1  j 

where  Mi  is  the  number  of  instances  of  the  ith  bag. 

To  find  the  optimal  set  of  parameters  b.  Ray  and  Page  proposed  an  algorithm  based  on 
an  EM  approach.  First,  a  hypothesis  hyperplane  b  is  initialized.  Then  the  algorithm  alter¬ 
nately  iterates  between  two  steps:  (1)  in  the  expectation  step,  from  each  bag  the  instance 
with  the  least  L-error  w.r.t.  to  b  is  selected,  and  (2)  in  the  maximization  step,  ordinary 
regression  is  performed  to  find  a  new  hyperplane  that  best  fits  the  selected  instances.  The 
process  continues  until  convergence.  The  Mi-regression  solution  is  summarized  in  Algorithm 
2.1 

2.1.3  Multiple  Instance  Learning  via  Embedded  Instance  Selection  (MILES) 

MILES  was  proposed  by  Chen  et  al.  [43] .  The  framework  converted  the  MIL  problem 
into  a  standard  supervised  learning  problem  by  mapping  each  bag  into  a  feature  space 
defined  by  the  similarity  between  its  instances  and  a  set  of  target  concepts.  Formally,  for 
a  given  bag  Bi  of  instances  Xjj ,  j  =  1, . . .  ,Mj,  the  similarity  to  a  given  target  concept  tk, 
k  =  1, . . . ,  C  ( C  number  of  target  concepts)  is  given  by: 

s(tk,  Bi)  =  max{exp(^  - 
where  a  is  a  scaling  factor. 

Using  (2.13),  a  bag  is  mapped  into  the  space  induced  by  the  similarity  values 
represented  by  the  coordinates  m(Bi)  as  following, 

m(Bi )  =  [s(ti,  Bi),  s(t2,  Bi), s(tc,  Bi)].  (2.14) 


(2.13) 


i.e.  a  bag  is 
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Algorithm  2.1  Multiple-Instance  Regression  Algorithm 


Inputs:  B:  the  set  of  training  bags. 

T :  the  set  of  training  labels, 
b:  random  initial  hyperplane. 

Mf  the  number  of  instances  in  bag  i. 
N:  the  number  of  training  bags. 
Outputs:  A  hyperplane  Y  =  Xb. 


E  =  oo 
Done  =  flase 
repeat 
1  =  0 
Error  =  0 
for  each  bag  B,  do 

for  each  instance  Xl3  in  B,  do 

L(yt,Xtj,  b)  =  (yi,  Xijh)2 ,  [Calculate  the  error  of  the  instance  with  respect  to  the 
hyper  plane] 

end  for 

1  =  1  U  {the  instance  with  the  lowest  error},  Let  this  error  be  Lrmn 
Error  =  Error  +  Lnnn 

end  for 

if  Error  >  E  then 
Done  =  true 

else 

E  =  Error 
solve  (2.12)  for  b 

end  if 
until  Done 
return  b. 


Considering  a  binary  MIL  classification  problem,  with  bag  labels  of  +1  and  — 1, 
MILES  uses  1-Norm  SVM  [62]  to  learn  a  linear  classifier  on  the  mapped  space,  i.e., 

C 

Hi  =  sign(^2  u>ks{tk,  )  +  b ).  (2-15) 

fc=i 

where  Wk  is  a  weight  associated  with  s(tk,  E,)  and  b  a  bias  parameter. 

2.1.4  Multiple  Instance  Neural  Networks 

In  this  approach  Zhou  and  Zhang  [63]  proposed  to  adapt  the  BackPropagation  algo¬ 
rithm  [64]  for  multiple  instance  learning  through  employing  a  modified  loss  function.  For  a 
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given  neural  network  of  one  or  more  hidden  layers,  a  bag  Bi  of  instances  Xij,  j  =  1, . . . ,  Mz , 
is  fed  to  the  network  one  instance  at  a  time,  and  for  a  given  instance  a  partial  network  error 
Eij  is  computed  as  following: 


Eij  —  < 


0  if  { Bi  is  positive )  and  (0.5  <  Oij ) 

0  if  {Bi  is  negative)  and  { Oij  <  0.5)  (2-16) 

\{Oij  —  0.5)2  otherwise , 

where  Oij  is  the  network’s  computed  output  when  presented  with  instance  X^. 

Given  (2.16),  the  overall  network  error,  E),  for  a  given  bag  Bi  is  computed  using  the 
following  heuristic 


Ei  =  { 


^mm^  Eij  if  {Bi  is  positive ) 
^max^Eij  if  {Bi  is  negative ) 


(2.17) 


Using  the  error  defined  in  (2.17),  and  given  the  neural  network  architecture  it  is  straight¬ 
forward  to  derive  a  backpropagtion  update  rule  for  the  network’s  weights.  To  speedup  the 
training  process  Zhou  suggested  that  when  the  partial  error,  ,  for  a  given  instance  Xtj  of 
bag  Bi  is  equal  to  zero,  the  rest  of  instances  should  be  skipped  and  not  fed  to  the  network. 

Even  though  this  solution  of  MIL  is  supposed  to  extend  neural  networks  to  reason 
with  bags,  it  is  still  relying  on  computing  with  single  instances.  Another  Multiple  instance 
neural  network  approach  was  proposed  by  Ramon  and  Raedt  [1].  In  this  work,  the  authors 
proposed  a  neural  network  architecture  composed  of  two  stages: 


1.  A  first  stage  composed  of  an  ensemble  of  subnetworks  (multilayered  perceptrons), 
{Netj }j=i,  with  count  equals  to  the  number  of  instances  of  the  input  bag  Bi.  All 
subnetworks  of  the  first  stage  are  identical  and  share  the  same  weights  (Hence,  also 
share  the  same  weight  update). 

2.  A  second  stage  that  aggregates  the  outputs  {Oj}^l1,  of  all  subnetworks.  For  the 
purpose  of  aggregation,  this  stage  uses  a  differentiable  version  of  the  “max”  function, 
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dmax,  defined  as: 


l  Mi 

dmaxa(Oi,  02,  •  •  • ,  0Mi)  =  -in(  ^  ea° ^  ,  (2.18) 

3= 1 

where  a  is  a  real-valued  parameter  that  controls  the  accuracy  of  the  max  function 
approximation. 

To  optimize  the  weights  of  the  network,  the  authors  derived  update  equations  using  the  com¬ 
monly  used  BackPropagation  algorithm  [64],  A  multiple  instance  neural  network  graphical 
representation  is  shown  in  Figure  2.2. 


stage  1  stage  2 


Figure  2.2:  Illustration  of  a  Ramon  &  Raedt’s  multiple  instance  neural  network  [1], 


2.1.5  Multiple  Instance  RBF  Neural  Networks 

Multiple  Instance  RBF  (Radial  Basis  Function)  Neural  Networks  (RBF-MIP)  is  an 
adaptation  of  the  standard  RBF  neural  network  [65]  for  the  problem  of  multiple  instance 
learning.  This  approach  was  introduced  by  Zhou  and  Zhang  [66].  Similarly  to  the  standard 
RBF  network,  the  RBF-MIP  is  composed  of  two  layers.  However,  as  opposed  to  the  standard 
RBF  neural  networks  where  the  first  layer’s  nodes  are  prototype  vectors  indicating  the 
centers  of  basis  functions,  the  first  layer  of  RBF-MIP  corresponds  to  clusters  of  training 

bags,  i.e. ,  each  input  node  of  RBF-MIP  is  a  cluster  C& ,  k  =  1, _ ,  K,  of  training  bags.  The 

second  layer  of  the  RBF-MIP  network  is  the  same  as  the  standard  RBF  neural  network.  A 
graphical  representation  of  a  typical  RBF-MIP  neural  network  is  shown  in  Figure  2.3. 
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Layer  1 


Layer2 


Figure  2.3:  Illustration  of  an  RBF-MIP  neural  network  with  a  single  output. 


In  the  first  layer  of  a  given  RBF-MIP  network,  the  clustering  of  bags  is  achieved 
by  merging  training  bags  agglomeratively  using  the  Hausdorff  metric  to  measure  distances 
between  bags  and  between  clusters  [67].  Formally,  given  two  bags  B\  and  B-2  of  instances 
{Xij}j=i  and  {X2j } j=?i >  respectively,  the  Hausdorff  metric  between  B\  and  B2  is  defined  as 


H(Ri,I?2)=  min  {dist(Xij ,  X2i)}  ■  (2-19) 

XljeB1,X2i£B2 

where  dist  is  a  distance  measure  of  the  instance  space  (e.g.  Euclidian  distance).  To  compute 
the  Hausdorff  metric  between  a  bag  and  a  cluster  of  bags,  first  the  instances  from  all  bags 
in  the  cluster  are  merged  into  a  new  bag  and  (2.19)  is  used  to  compute  the  metric.  The 
clustering  process  is  summarized  in  Algorithm  2.2. 

For  a  given  input  bag  Bi,  the  first  layer’s  outputs  are  computed  as  follows: 


Ok 


exp 


H (Bz,Ck)2  \ 


if  0 


(2.20) 


[  1  if  k  =  0 

where  <rfc  is  a  standard  deviation  parameter  whose  value  controls  the  smoothness  of  the  £:th 
input  node  function.  <7*.  is  fixed  to  the  same  value  a  that  is  the  same  for  all  input  nodes 
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Algorithm  2.2  First  Layer’s  Clustering  Algorithm  of  RBF-MIP 


Inputs:  B:  the  set  of  training  bags. 

N:  the  number  of  training  bags. 

K:  number  of  remaining  clusters  in  the  first  layer. 
H:  Hausdorff  metric. 

Outputs:  {Ck)k=\  clusters  of  training  bags. 


Begin  with  one  cluster  per  training  bag  (Cj  =  {Hi}, . . . ,  Cn  =  {-Btv}) 
while  there  are  more  than  K  clusters  do 

Merge  the  two  clusters  Cj,  Cj  which  minimize  H(Cj,  Cj) 
end  while 
return  {Ck}%=1. 


and  is  computed  by  taking  the  average  distance  between  every  pair  of  clusters,  i.e., 


a  =  /i  x 


'Y,f=i  EjLi+i  H(Ci,  Cj) 
K(K  -  l)/2 


(2.21) 


In  (2.21),  n  is  a  scaling  factor. 

To  optimize  the  weights  of  the  second  layer  of  the  RBF-MIP  neural  network  a  surn-of- 
squared  error  loss  function  is  minimized  similarly  to  the  standard  RBF  networks  [65]. 


2.1.6  Citation  K-Nearest  Neighbors 

In  the  standard  K-NN  classifier  (K-Nearest  Neighbors),  to  classify  a  given  instance, 
“K”  nearest  instances  are  retrieved  using  a  distance  measure  on  the  instance  space  (e.g. 
Euclidian  distance),  then  an  output  label  is  computed  from  the  labels  of  the  “K”  nearest 
instances.  Using  the  same  approach,  Wang  and  Zucker  [67]  adapted  K-NN  for  the  case  of 
multiple  instances.  To  determine  the  nearest  neighbors  for  a  given  bag,  the  Hausdorff  metric 
(defined  at  (2.19))  is  used  instead  of  the  Euclidian  distance.  Then  the  K-NN  algorithm 
can  be  applied  directly.  Wang  and  Zucker  found  that  the  majority  vote  method,  used  by 
standard  K-NN,  often  produced  sub-optimal  results  in  the  multiple  instance  setting  [68].  To 
improve  the  multiple  instance  K-NN,  they  proposed  a  variation  called  Citation-KNN  [68] . 
Citation-KNN  is  motivated  by  the  notion  of  citation  from  library  and  information  science. 
Under  this  view  the  authors  defined  a  “C-nearest  citers”  measure  for  a  given  bag.  This 
measure  is  defined  as  following: 


20 


•  For  two  given  bags,  B  and  B' ,  let  Rank(B' ,  B)  equals  n  if  B  is  the  nth  nearest 
neighbor  of  B1 . 

•  Then,  the  C-nearest  citers  of  B  are  the  C  bags  that  return  the  lowest  neighbor  ranking 
for  B.  i.e. , 

Citers(B,C)  =  {Bj  \  Rank(Bi,  B)  <C,  Bi  e  B},  (2.22) 

where  B  is  the  set  of  all  training  bags. 

The  decision  of  Citation-KNN  relies  on  the  K-nearest  bags  as  well  as  the  C-nearest 
citers.  Specifically,  a  bag  is  classified  as  positive  if  and  only  if  there  are  strictly  more  positive 
bags  than  negative  bags  in  the  combined  K-nearest  bags  and  C-nearest  citers.  C  is  usually 
set  to  K+2. 

2.2  Fuzzy  Logic 

Research  on  fuzzy  set  theory  goes  back  to  1965  [69].  The  first  main  development 
started  with  Zadeh  [69]  introducing  fuzzy  sets  to  extend  classical  set  theory,  and  offering 
an  intuitive  approach  to  model  and  manipulate  data  with  imprecision  and  uncertainty.  Few 
years  later,  fuzzy  logic  was  introduced  by  the  same  author  [4],  Fuzzy  logic  is  a  computational 
framework  that  makes  use  of  fuzzy  set  theory  and  fuzzy  assignment  of  elements  to  sets.  In 
classical  set  theory,  also  known  as  crisp  sets,  an  element  is  either  a  member  of  a  set  or 
not.  Whereas,  in  fuzzy  set  theory,  an  element  is  characterized  by  a  degree  of  membership, 
usually  a  real  number  between  0  and  1.  Fuzzy  logic,  in  contrast  to  traditional  two- valued 
(boolean)  logic,  uses  the  elements’  membership  degrees  to  evaluate  the  degree  of  truth  of 
logical  propositions.  Hence,  the  degree  of  truth  is  non-crisp,  or  soft.  This  enables  fuzzy 
logic  to  be  characterized  by  linguistic  terms  rather  than  by  numbers.  For  example,  in  fuzzy 
logic,  a  fuzzy  proposition  can  have  the  following  expression:  “  The  temperature  is  high”,  in 
which  the  linguistic  term  “high”  is  a  fuzzy  set  that  describes  the  temperature.  Fuzzy  logic 
simulates  human  imprecise  understanding  of  the  world,  and  can  be  viewed  as  a  framework 
for  computing  with  words  [46]. 
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2.2.1  Fuzzy  Sets 


A  fuzzy  set  expresses  the  degree  to  which  an  element  belongs  to  a  set.  It  has  a 
characteristic  function  that  describes  the  membership  degree  of  an  element  in  the  set  and 
takes  values  between  0  and  1. 

Let  X  represent  a  collection  of  objects,  referred  to  as  the  universe  of  discourse.  Formally  a 
fuzzy  set  A  in  X  is  defined  as: 

A  =  {(x,ha{x))  |  x  £  X},  (2.23) 

where  ha{x)  is  called  the  membership  function  (MF)  for  fuzzy  set  A.  The  MF  maps  every 
element  of  A  to  a  membership  degree,  ha(x)  £  [0, 1]. 

The  difference  between  a  crisp  set  and  a  fuzzy  set,  is  that  the  MF  is  allowed  to  take  any  value 
in  the  interval  [0, 1]  rather  than  {0, 1}.  A  simple  interpretation  of  the  degree  of  membership 
is  given  by: 

•  ha(x)  =  I  if  x  is  totally  in  A 

•  ha{x )  =  0  if  x  is  not  in  A 

•  0  <  ha(x)  <  1  if  x  is  partly  in  A 

To  clarify  this  definition,  let  us  consider  the  subjective  example  of  a  person’s  age.  Clearly, 
there  is  no  crisp  boundary  beyond  which  a  person  can  be  considered  “young”  or  not.  If  we 
model  this  statement  by  means  of  a  crisp  set,  we  need  to  use  an  expression  of  the  following 
form: 

Young  =  { x  \  age(x )  <  25,  x  £  X},  (2.24) 

where  X  is  the  set  of  all  people.  An  illustration  of  the  crisp  membership  function  of  this 
example  is  shown  in  Figure  2.4.  It  should  be  clear  that  this  crisp  representation  is  not 
appropriate  to  model  the  concept  of  age.  In  fact,  using  this  representation,  a  person  who 
is  24.9  years  old  is  considered  young,  while  a  person  who  is  25.1  years  old  is  not  young. 

A  fuzzy  set  representation,  however,  does  not  define  any  hard  boundaries.  Instead, 
it  gradually  assigns  older  people  to  the  fuzzy  set  Young  in  an  ordered  manner.  It  can  be 
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t^Y  oung 


25  age 

Figure  2.4:  An  illustration  of  the  crisp  membership  function  “Young” 


described  by: 


Young  =  {{x ,  gY oung(x))  \  x  E  X}, 


(2.25) 


where  gvoung  is  the  membership  function  of  the  fuzzy  set  Y oung,  and  is  illustrated  in  Figure 
2.5.  In  Figure  2.5,  people  of  age  between  0  and  25  are  considered  young,  whereas  people 


Figure  2.5:  An  illustration  of  the  fuzzy  membership  function  “Young” 


older  than  40  are  not  considered  young.  Between  the  ages  of  25  and  40,  the  membership 
degree  gradually  decreases  to  0.  This  representation  is  close  to  the  way  humans  perceive 
the  statement  of  a  “person  is  young” . 

The  construction  of  a  fuzzy  set  depends  on  two  main  factors:  the  identification 
of  a  suitable  universe  of  discourse,  and  the  specification  of  an  appropriate  membership 
function  [70].  In  some  applications,  such  as  control,  the  fuzzy  sets  are  typically  designed 
by  experts  using  domain  knowledge.  For  other  applications,  such  as  pattern  recognition, 
fuzzy  sets  can  be  learned  from  training  data.  In  this  case,  the  membership  functions  are 
parameterized  functions  and  training  data  is  used  to  learn  the  optimal  set  of  parameters 
that  best  fit  the  data.  Some  of  the  common  parameterized  MFs  include  triangular  MF, 
trapezoidal  MF,  Gaussian  MF,  and  the  generalized  bell  MF.  The  Shape  and  parameters  of 
these  MFs  are  shown  in  table  2.1. 
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TABLE  2.1 


Shape  and  parameters  of  commonly  used  parameterized  MFs 


MF 

Equation  for  fJ,(x) 

Parameters 

Shape  | 

Triangular  MF 

max(mm(fEf,§Ef),0) 

a,  b,  c 

1 

'  A\ 

a  c  b 

Trapezoidal  MF 

max(min( 

a,  b ,  d,  c 

1 

l  K 

a  b  c  d 

Gaussian  MF 

(  (x—c)2  \ 

exp{  ~2jr~ ) 

c,  a 

1 

Generalized  bell  MF 

1 

|  2b 

a,  b,c 

1 

f  \ 

1+M 

j  V 

c-a  c  c+a  j 

A  fuzzy  set  is  uniquely  identified  through  its  membership  function.  The  a-cut  of 
fuzzy  set  A,  Aa  is  usually  used  to  describe  membership  functions  in  more  details.  AQ  is 
defined  as: 

Aa  =  {xeX  A(x)  >  a},  (2.26) 

where  a  G  [0, 1],  A1  is  called  the  core  of  A,  and  A 0  is  called  the  support  of  A.  Figure  2.6 
illustrates  the  a-cut,  core,  and  support  of  a  bell-shaped  membership  function. 

As  in  classical  crisp  sets,  the  most  basic  operations  for  fuzzy  sets  are:  union,  inter¬ 
section,  and  complement.  In  the  following,  let  A  and  B  be  two  fuzzy  sets  with  membership 
functions  and  hb(x)- 

•  The  union  of  two  fuzzy  sets  A  and  B,  often  called  “join” ,  is  a  fuzzy  set  C  characterized 
by  a  membership  function  //<?  defined  as: 

fic(x)  =  havb(x)  =  max(/iA(x),  hb(x)).  (2.27) 
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Figure  2.6:  Illustration  of  a-cut,  core,  and  support  of  a  bell-shaped  membership  function 

•  The  intersection  of  two  fuzzy  sets  A  and  B,  also  known  as  “meet”,  is  a  fnzzy  set  C 
characterized  by  a  membership  function  /r c  defined  as: 

Hc{x)  =  n a/\b{x )  =  min(nA{x) i  hb{x)) .  (2.28) 

•  The  complement  of  fuzzy  set  A,  denoted  by  —*A  is  defined  as: 

^a(x)  =  1  -  ha{x).  (2.29) 

The  physical  interpretation  of  the  above  fuzzy  set  operators  relates  to  the  linguistic  concepts 
of  OR,  AND,  and  NOT.  For  instance,  if  the  fuzzy  sets  A  and  B  describe  the  Youthfulness 
and  Tallness  of  a  person  respectively,  then  applying  the  set  operators  leads  to  the  following 
statements: 

•  Ha\/b{x)  =  the  degree  to  which  x  is  either  “Young”  or  “Tall”; 

•  TAab(x )  =  the  degree  to  which  x  is  both  “Young”  and  “Tall”; 

•  h^a{x)  =  the  degree  to  which  x  is  not  “Young”. 

We  should  emphasize  here  that,  in  addition  to  the  definitions  in  (2.27),  (2.28),  and  (2.29), 
there  are  multiple  ways  to  define  fuzzy  union,  intersection,  and  complement.  Most  of  the 


25 


TABLE  2.2 


Most  frequently  used  T-norms  and  T-conorms  operators 


T-norms 

Minimum 

Tmin(a ,  b )  =  min(a ,  b)  =  a  A  b 

Algebraic  product 

Tap(a,b)  =  ab 

Bounded  product 

Tbp(a,  b)  =  0  V  (a  +  b  -  1) 

Drastic  product 

(  a  if  b  =  1 
Tdp(a ,  b)  =  <  b  if  a  =  1 
l  0  if  a,  b  <  1 

T-conorms 

Maximum 

Tcmax(a,  b)  =  max(a,  b)  =  a  V  b 

Algebraic  sum 

Tcas{a ,  b)  =  a  +  b  —  ab 

Bounded  sum 

Tcbs(a ,  b)  =  1  A  (a  +  b) 

Drastic  sum 

(a  if  b  =  0 
Tcds{a,b )  =  <  b  if  a  =  0 
l  0  if  a,b  >  0 

operators,  except  complement,  fall  under  two  categories.  The  first  one,  called  T-norms, 
is  a  class  of  fuzzy  intersection  operators  [70]  suitable  to  carry  intersection,  cartesian  prod¬ 
uct,  and  as  we  will  see  later,  fuzzy  implication.  The  second  category  of  operators,  called 
T-conorms,  is  a  class  of  fuzzy  union  operators  [70]  suitable  to  carry  union  and  other  ag¬ 
gregation  operations. 

The  most  frequently  used  T-norms  operators  include  Minimum,  Algebraic  product,  Bounded 
product,  and  Drastic  product.  Similarly,  T-conorms  operators  include,  Maximum,  Algebraic 
sum,  Bounded  sum,  and  Drastic  sum.  These  operators  are  defined  in  table  2.2. 

In  addition  to  modeling  union,  intersection,  and  complement,  fuzzy  set  theory  em¬ 
beds  mechanisms  to  model  compensatory  operations,  i.e.,  aggregation  operators  where  a 
high  value  in  matching  one  criterion  can  compensate  to  some  extent  for  a  low  value  for 
another  criterion.  In  the  following  we  list  five  examples  of  such  operators. 

1.  Generalized  mean:  Let  ai,a2,...,an  be  the  degrees  of  satisfaction  of  n  criteria. 
The  generalized  mean  is  defined  as: 

ha(ai,  02, ... ,  an)  =  y - — - J  ,  (2.30) 

where  a  is  a  fixed  real  number.  For  a  =  1,  ha  implements  the  arithmetic  average. 
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Similarly,  when  a  =  —1,  ha  is  the  harmonic  average,  and  as  a  approaches  0,  the 
generalized  mean  converges  to  the  geometric  mean.  All  instantiations  of  the  gener¬ 
alized  mean  produce  values  between  the  minimum  and  maximum  of  the  degrees  of 
satisfaction  of  the  individual  criteria. 


2.  Fuzzy  hybrid  operators:  Fuzzy  hybrid  operators,  combine  different  types  of  fuzzy 
set  operators  into  a  single  equation. 

•  Arithmetic  hybrid  operators 

d0B  =  (1-7pnB)+7(iUB),  (2.31) 

7 

•  Multiplicative  hybrid  operators 

d(0B  =  (inB)1-puB)7.  (2.32) 

7 

In  (2.31)  and  (2.32),  7  E  [0,1]  controls  the  amount  of  “mixing”  of  the  union  and 
intersection  components. 


3.  Zimmermann  hybrid  operator:  is  a  hybrid  operator  for  multi-criteria  aggrega¬ 
tion  that  was  modeled  after  the  compensatory  nature  of  human  aggregation.  For 
a  1 ,  <22 ,  •  -  . ,  an  degrees  of  satisfaction  of  n  criteria,  the  Zimmermann  hybrid  operator  is 
defined  as: 


h 'y  (ai ,  Oj2  1  ■  •  ■  1  ®n)  —  (  J  |  ( 


a,: 


1-7 


1  -  IF 


-  a; 


(2.33) 


Vi  1  /  \  *=1  / 

where  7  E  [0, 1]  is  a  mixing  coefficient,  and  <5*  are  weights  associated  with  each  criterion 
a.;,  such  that  ]T)”=1  &i  =  n- 

4.  Ordered  Weighted  Averaging  Operator  (OWA)  [71]:  Let  {ai,  02  . . . ,  an}  be 

n  degrees  of  satisfaction  of  a  given  criteria,  and  w  =  (w\ ,W2,  ■  ■  ■  ,wn)T  be  a  weight 
vector  such  that  wt  E  [0,1]  and  Ya=i  Wi  =  1-  a(j )  indicate  the  sorted  dj  from 
largest  degree  of  satisfaction  to  the  minmum  (i.e.,  am  =  maa:{ai,a2  . . .  ,an}).  The 
OWA  operator  is  a  mapping  function,  OWA  :  Rn  —$■  R  defined  as: 


OWA(ai,a2  ...  ,an)  =  ^2wja(j)-  (2.34) 

j=i 
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5.  Ordered  Weighted  Geometric  Averaging  Operator  (OWGA)  [72]:  OWGA 
is  of  similar  form  to  OWA  and  is  defined  as: 

n 

OWGA(a\,  02  . . . ,  an)  =  wja(j)-  (2.35) 

3= 1 

2.2.2  Fuzzy  Propositions 

In  fuzzy  logic,  a  fuzzy  proposition  is  defined  as 

p  :  X  is  A  (2.36) 

Where  X  receives  values  x  from  a  universal  set  U  and  A  is  a  fuzzy  set  on  U .  For  example, 
a  proposition  can  be,  “ temperature  is  high ”,  or  “ patch  is  blue” .  Each  fuzzy  proposition 
has  a  degree  of  truth  T(p)  that  is  the  membership  degree  of  X  =  x  in  A,  denoted  by  pa(x). 

2.2.3  Fuzzy  If-Then  Rules 

A  fuzzy  if-then  rule  is  expressed  as 

if  x  is  A  then  y  is  B  (2.37) 

where  A  and  B  are  fuzzy  sets  on  universes  of  discourse  X  and  Y,  respectively.  The  phrase 
“x  is  A ”  is  often  called  premise  (or,  antecedent),  and  the  phrase  “y  is  B  " ,  is  called  conse¬ 
quence.  Fuzzy  rules  can  have  multiple  antecedents  and  multiple  consequences  connected 
with  fuzzy  operators.  Examples  of  fuzzy  if-then  rules  include: 

•  If  a  person  is  young  then  income  level  is  low. 

•  If  temperature  is  high  then  turn  AC  on. 

•  If  time  is  day  and  sky  is  blue  then  weather  is  good,  (notice  the  multiple  antecedents) 

•  If  pressure  is  high  then  volume  is  small  and  temperature  is  high,  (notice  the  multiple 
consequences) 

Interpreting  a  fuzzy  if-then  rule  involves  two  main  steps.  First,  the  antecedent  part  of  the 
rule  is  evaluated.  This  involves  fuzzifying  the  input.  The  second  step  consists  of  applying 
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the  results  of  the  antecedent  expression  to  the  consequence  using  fuzzy  implication  [73] .  In 
particular,  as  defined  by  Ramot  et  al.  [74],  a  rule  represents  a  fuzzy  implication  relation  be¬ 
tween  unconditional  fuzzy  propositions  p  and  q,  where  proposition  p  is  the  phrase  “x  is  A” , 
and  q  is  the  phrase  “y  is  B" .  For  instance,  rule  (2.37)  combines  the  fuzzy  propositions  (p, 
q)  into  a  logical  implication  denoted  by  p  -A  q,  which  is  sometimes  abbreviated  as  A  — >  B. 
The  implication  is  in  essence  a  fuzzy  relation  R  between  p  and  q  on  the  product  space 
Ixh  Formally, 

R  =  A^B  =  AxB=[  ha(x)  -k  pB(y)/(x,y)1  (2.38) 

JXxY 

where  *  is  a  T-norm  operator  and  A  x  B  is  used  to  represent  the  fuzzy  relation  R.  R 
has  a  membership  function  denoted  p,A-^B(x,y)  that  represents  the  degree  of  truth  of  the 
implication  p  — >•  q  when  X  =  s  and  Y  =  y.  In  the  literature,  the  most  commonly  used 
implication  operators  are  the  T-norms  “min”  and  “algebraic  product”.  In  this  case,  (2.38) 
can  be  written  as: 


or, 


pA^B(x,y)=  pA{x)  A  nB(y)/(x,y)  =  min[nA(x),  nB(y)\  (2.39) 

JXxY 

pA-+B{x,y)  =  /  pa(x)  ■  pB(y)/(x,y)  =  pA{x)  ■  yB(y)  (2.40) 

JXxY 


2.2.4  Fuzzy  Reasoning 

Fuzzy  Reasoning  is  “an  inference  procedure  that  derives  conclusions  from  a  set  of 
fuzzy  if-then  rules  and  known  facts”  [2,70].  The  inference  is  carried  using  the  Generalized 
Modus  Ponens  rule  [4,70],  which  is  given  by  the  following  scheme 

premise  if  x  is  A  then  y  is  B 

fact  x  is  A' 

consequence  y  is  B' 

1The  notation  fxy(x)/x  stands  for  the  union  of  membership  grades,  and  “/”  stands  for  a  marker  and 
does  not  imply  division. 
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The  premise  part  is  a  fuzzy  rule  as  defined  in  (2.37),  A  and  B  are  fuzzy  sets  on  the  universes 
of  discourse  X  and  Y .  The  fact  is  a  fuzzy  proposition  and  A'  is  in  turn  a  fuzzy  set  on  X.  The 
consequence  part  Bl  can  be  derived  using  the  compositional  rule  of  inference  introduced  by 
Zadeh  in  1973  [4],  B'  is  determined  as  a  composition  of  the  fact  and  the  fuzzy  implication 
operator.  Specifically, 

B'  =  A'  o  (A  ->  B )  (2.41) 

or,  equivalently, 

hB'{y)  =  maxx(min[nA'(x),  ha^b(x,v)})  (2.42) 

Using  (2.39),  (2.42)  can  be  rewritten  as, 

MB'(y)  =  maxx(min{iiA'{x),min[iiA(x),  iiB{y)])  (2.43) 

Further  simplification  of  (2.43)  yields: 

/TB'(y)  =  min(maxx(min[^A'(x),  ha(x)}),  fisiy))  (2.44) 

The  quantity  maxx(min[/jLA'{x),  Ha{x)])  is  known  in  the  literture  as  rule  firing  strength. 

To  summarize,  fuzzy  reasoning  involves  the  following  3  main  steps: 

1.  Start  by  computing  the  proposition  degree  of  truth,  i.e.  evaluate  rnin[yLA'{x), /ia{x)\\ 

2.  Compute  the  rule  firing  strength,  or  as  pointed  by  Jang  [70],  the  degree  of  belief  for 
the  antecedent  part; 

3.  Compute  the  degree  of  belief  of  the  consequent  part  by  applying  the  “min”  operator. 

To  better  understand  the  fuzzy  reasoning  process,  we  analyze  a  simple  generic  ex¬ 
ample  that  is  based  on  the  following  fuzzy  if-then  rule 

if  x  is  A  then  y  is  B  (2-45) 

In  (2.45),  A  and  B  are  fuzzy  sets  on  the  universes  of  discourse  X,  and  Y .  Given  the  fact 
x  is  A! ,  we  want  to  evaluate  rule  (2.45)  using  fuzzy  reasoning  process  defined  by  equation 
(2.43).  This  process  is  illustrated  in  Figure  2.7.  First  we  compute  the  rule  firing  strength, 
maxx(min(nA'  (x),  ha{%)))-  Then  we  infer  the  fuzzy  set  Bl  as  B  clipped  by  the  rule  firing 
strength. 
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premise  part 


consequent  part 


2.3  Fuzzy  Inference 

Fuzzy  inference  is  a  powerful  modeling  framework  that  can  handle  computing  with 
knowledge  uncertainty  and  measurements  imprecision  effectively  [2].  Fuzzy  Inference  is 
based  on  the  concepts  of  fuzzy  set  theory,  fuzzy  if-then  rules,  and  fuzzy  reasoning.  It  per¬ 
forms  a  non-linear  mapping  from  an  input  space  to  an  output  space  by  deriving  conclusions 
from  a  set  of  fuzzy  if-then  rules  and  known  facts  [3].  Fuzzy  Inference  has  been  successfully 
applied  to  a  wide  range  of  problems,  such  as  control  [4-14],  time  series  prediction  [75],  pat¬ 
tern  recognition  [20],  and  more  recently  classifier  fusion  [23].  Mamdani  [48]  and  Sugeno  [49] 
are  the  two  commonly  used  fuzzy  inference  systems. 


2.3.1  Mamdani  Fuzzy  Inference  System 

A  Mamdani  fuzzy  inference  system  is  an  effective  computing  framework  [48,76]  based 
on  fuzzy  reasoning.  This  type  of  inference  systems  can  be  totally  defined  by  means  of  a 
fuzzy  rule  base  (FRB)  composed  of  a  union  of  if-then  fuzzy  rules. 

For  an  input  vector  x  =  {xj  \  j  =  1, . . . ,  D},  a  typical  Mamdani-style  fuzzy  rule,  1Z1,  has 
the  following  form: 

TV  :  If  x\  is  A\  and  X2  is  Al2 , . . . ,  and  xp  is  AlD)  then  ol  is  Cl.  (2.46) 

In  (2.46)  VJ , i  =  1,2, ...  ,r,  is  the  ith  fuzzy  rule  of  the  FRB,  Aj  is  a  fuzzy  set  associated 
with  the  jth  input  Xj,  and  Cl  is  the  fuzzy  set  describing  the  output  of  the  ith  rule.  These 
fuzzy  sets  consist  of  linguistic  labels  characterized  by  parameterized  membership  functions. 
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The  FRB  is  the  union  of  all  rules,  i.e., 


FRB  =  (J  77*.  (2.47) 

1=1 

Figure  2.8  is  a  graphical  representation  of  a  two-rule  Mamdani  fuzzy  inference  system 
and  how  it  derives  the  output  z  when  subject  to  a  crisp  input  x  =  {xj\j  =  1, . . .  ,£>}.  The 
inference  starts  by  fuzzification  of  x.  Fuzzifcation  assigns  a  membership  degree  to  each 
input  dimension  in  the  rules  input  fuzzy  sets.  As  shown  in  in  Figure  2.8,  x  activates  the 
zth  input  fuzzy  set  of  the  jth  rule  by  a  degree  of  truth  Wij  .  Next,  an  implication  process 
is  executed  resulting  in  the  activation  of  the  rules’  output  with  different  degrees.  In  this 
example,  we  use  a  simple  min  operator,  and  the  output  of  rule  774  will  be  activated  by  a 
degree  Wj  =  mirik=it...:D'UJkj-  Next,  using  a  simple  max  operator,  the  2  output  fuzzy  sets 
are  aggregated  to  generate  one  output  fuzzy  set.  Finally,  the  output  set  is  defuzzified  (e.g. 
using  its  centroid)  to  generate  a  final  crisp  output  value. 

premise  part  consequent  part 


t 


Z  (centroid  of  area) 

Figure  2.8:  Illustration  of  Mamdani  fuzzy  inference  with  2  rules  and  D  inputs. 

The  system  in  Figure  2.8  implements  a  nonlinear  mapping  from  its  input  space  to  an 
output  space.  Each  fuzzy  rule  describes  a  local  context  in  which  the  mapping  is  achieved. 
The  input  and  output  membership  functions  can  be  designed  by  leveraging  expert  knowledge 
(this  practice  is  common  in  control  problems),  or  can  be  learned  directly  from  the  data. 
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Specifically,  labeled  training  data  can  be  used  to  learn  the  FRB  and  the  parameters  of 
their  membership  functions.  Typically,  grid-based  or  clustering-based  algorithms  are  used 
to  partition  the  input  space  [70].  Each  cluster  will  be  represented  by  one  fuzzy  rule  that 
describes  a  local  context.  Input  membership  functions  will  be  generated  based  on  the 
statistics  of  the  input  features  within  each  context.  Output  membership  functions  can  be 
generated  by  considering  the  distributions  of  labels  within  each  context  [23] . 

2.3.2  Sugeno  Fuzzy  Inference  System 

The  Sugeno  fuzzy  model  [49]  was  the  first  attempt  at  learning  fuzzy  rules  from  the 
training  data.  Similar  to  Mamdani  system,  the  Sugeno  fuzzy  inference  system  is  defined 
by  means  of  a  fuzzy  rule  base.  However,  unlike  Mamdani  rules,  a  Sugeno  rule  does  not  use 
fuzzy  sets  to  describe  the  consequent  part.  Instead,  it  uses  a  crisp  function  /()  to  compute 
the  output.  A  typical  Sugeno  rule  is  defined  as  following 

TV  :  If  x\  is  A]  and  X2  is  A2, . . . ,  and  xjj  is  AlD ,  then  ol  =  f(x i,  £2,  •  •  • ,  x_d).  (2.48) 

where  TV ,  i  =  1,2,...,?’,  is  the  ith  Sugeno  fuzzy  rule,  A l-  is  a  fuzzy  set  associated  with  the 
jth  input  Xj.  Typically,  /()  is  polynomial  in  the  input  variables  x\, . . .  ,xr>.  In  this  case 
(2.48)  can  be  rewritten  as: 

D 

TV  :  If  x\  is  A\  and  x 2  is  Al2, . . . ,  and  xd  is  AlD ,  then  ol  =  ^  b\  ■  Xk ■  (2.49) 

fc=i 

where  1>q.  ....  hlD  are  the  polynomial  coefficients.  When  the  polynomial  coefficients  b1  are  first 
order,  The  Sugeno  fuzzy  model  is  called  first  order,  and  zero  order  when  the  polynomial 
coefficients  are  zero  order. 

The  choice  of  a  polynomial  function  makes  the  Sugeno  method  computationally  effective  and 
works  well  with  optimization  and  adaptive  techniques.  This  made  Sugeno  style  inference 
very  attractive  in  control  problems,  particularly  for  dynamic  nonlinear  systems  [77]. 

Figure  2.9  illustrates  the  Sugeno  fuzzy  inference  procedure  with  2  rules.  The  premise 
part  is  evaluated  as  in  the  the  Mamdani  system.  Every  rule  Rl  is  activated  with  a  degree 


33 


Wi,  firing  strength.  The  output  of  every  rule  is  a  crisp  value,  o1  and  o2,  the  overall  output 
of  the  system  is  obtained  by  taking  the  weighted  average  of  rules’  outputs. 

premise  part  consequent  part 


WlO 1  +  w2o2 

o  = - 

w  I  +  W2 


Figure  2.9:  Illustration  of  Sugeno  fuzzy  inference  with  2  rules  and  D  inputs. 


2.3.3  ANFIS:  Adaptive  Neuro-Fuzzy  Inference  System 


The  Adaptive  Neuro-Fuzzy  Inference  System  (ANFIS)  [50]  is  a  universal  approxi¬ 
mator  that  combines  the  learning  and  modeling  power  of  neural  networks  and  fuzzy  logic 
into  an  adaptive  inference  system.  Neural  network  deals  with  imprecise  data  by  training, 
while  fuzzy  logic  can  deal  with  the  uncertainty  of  human  cognition.  ANFIS  offers  an  alter¬ 
native  to  rule  identification.  The  Mamdani  and  Sugeno  fuzzy  system  identify  rules  based 
on  intuition.  ANFIS,  in  contrast,  can  jointly  learn  the  optimal  input  space  partition  and 
the  optimal  output  parameters  through  optimization.  ANFIS  is  a  hybrid  intelligent  system 
which  implements  a  Sugeno  fuzzy  inference  system  and  provides  a  systematic  approach  to 
generate  fuzzy  rules  from  a  given  input-output  dataset.  Typically,  ANFIS  is  structred  in  a 
feedforward  neural  network  that  contains  five  layers.  Figure  2.10  is  a  graphical  representa¬ 
tion  of  an  ANFIS  system  with  two  Sugeno  style  rules  and  2  inputs,  given  by 
1Z]  :  If  x\  is  M{  and  X2  is  M 2  then  f±  =  p\X\  +  q\X2  +  r±. 

(2.50) 


1Z2  :  If  x\  is  and  X2  is  M2  then  /2  =  P2X1  +  92^2  +  r 2. 
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where  M ^  is  a  fuzzy  set  associated  with  the  jth  input  of  rule  k,  and  {pk,Qkirk}  are  the 
consequent  parameters  of  the  fcth  fuzzy  rule.  Nodes  of  same  layers  have  similar  functions. 


We  denote  the  output  of  the  ith  node  in  layer  l  as 

Layer  1  known  as  the  fuzzification  layer,  and  is  adaptive.  It  calculates  the  degree  to  which 
a  given  input  satisfies  a  fuzzy  set  M.  Every  node  evaluates  the  membership  degree 
of  an  input  in  the  fuzzy  set  Mfc  of  membership  function  pMk.  Generally,  pMk  is  a 
parameterized  membership  function  (MF),  for  example  Gaussian  MF,  where 

HMk  (®)  =  exp(— ^ — W^-),  (2.51) 

In  (2.51)  ckj  and  akj  are  the  mean  and  variance  of  the  Gaussian  function,  and  are 
referred  to  as  the  premise  parameters. 

Layer  2  is  a  fixed  layer  where  every  node  computes  the  firing  strength  of  a  rule.  The 

output  is  the  product  of  all  incoming  inputs. 

2 

02,i  =  Wi  =  (2.52) 

3= i 

Layer  3  is  called  “normalized  bring  strength”.  It  calculates  the  ratio  of  a  rule’s  firing 
strength  to  the  sum  of  all  rules’  firing  strengths. 

03,i  =  Wi=  ,  (2.53) 

wi 

where  r  is  the  number  rules. 
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Layer  4  is  an  adaptive  layer,  it  calculates  each  rule’s  output  according  to  (2.50). 

C>4,i  =  Wifi  =  WifpiX  1  +  qtX2  +  n),  (2.54) 

Layer  5  is  a  fixed  layer,  it  computes  the  overall  output  which  is  the  summation  of  all 
incoming  signals. 

0.5, i  =  22  Wifi  =  22  ™i(Pix i  +  +  n),  (2.55) 

i  i 

In  the  following,  we  assume  that  we  have  N  D  dimensional  training  observations 
{xi, . . .  ,xn}  with  desired  output  T  =  {tj\j  =  1, . . .  ,N}.  ANFIS  is  devised  to  learn  its 
parameters  from  training  data.  This  process  typically  involves  two  step.  1)  A  structure 
identification  and  initialization  step,  and  2)  a  parameters  optimization  step.  These  two 
steps  are  described  below. 

1.  Model  structure  identification  and  initialization:  This  step  involves  finding 
an  optimal  partition  of  the  input  space  and  initializing  the  fuzzy  if-then  rules.  This 
task  can  be  achieved  using  input  space  partitioning  method  as  in  the  Mamdani  FIS. 
However,  unlike  Mamdani  inference,  ANFIS  optimizes  the  parameters  of  the  fuzzy 
sets  M3k.  Thus,  we  need  to  use  parameterized  membership  functions  that  are  differen¬ 
tiable.  Typically,  Gaussian  membership  function  is  used.  This  MF  can  be  completely 
determined  by  two  scalar  parameters  (center  c  and  width  a): 

HMk(x)  =  exp{-^- — C^~),  (2.56) 

J  l(Tkj- 

Thus,  identifying  the  structure  of  the  ANFIS  network  is  equivalent  to: 

(a)  Partitioning  the  N  D-dimensional  training  data  into  r  clusters  (i.e.  rules).  For 
this  step,  standard  clustering  algorithms  such  as  the  FCM  algorithm  [60]  can  be 
used. 

(b)  Initializing  the  premise  parameters: 

V  =  {cij,(Tj  |  *  =  1,2,  ---r-j  =  !,■■■  ,D] 
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Typically,  c*  is  set  to  the  center  of  the  ith  cluster,  and 

d  =  vx  E  (2-57) 

fc=l,  »  6  fc 

In  (2.57)  x'fcj  is  the  jth  component  of  the  /ctli  observation,  u),  indicates  the  mem¬ 
bership  of  observation  Xk  in  cluster  i. 

(c)  Initializing  the  consequent  parameters: 

C  =  {pi,  qi,  ri\  i  =  1,2,  •••r} 

Typically,  a  least  squares  estimator  is  used  to  initialize  Cl  as  follows: 

C  =  (, XtX)~1XtT .  (2.58) 

where  X  is  the  Nx(D  +  1)  matrix  of  input  training  data  right-padded  with  a 
column  vector  of  all  l’s.  T  is  a  column  vector  of  the  desired  outputs.  XT  is  the 
matrix  transpose  of  X. 

2.  Parameter  Optimization:  Once  the  structure  of  the  network  is  defined  and  ini¬ 
tialized,  an  optimization  and  fine-tuning  step  of  the  system  parameters  is  executed, 
the  hybrid  learning  rule  [50]  based  on  alternating  optimization  to  learn  the  opti¬ 
mal  premise  and  consequent  parameters.  During  the  network  forward  pass,  premise 
parameters  are  fixed  and  consequent  parameters  are  updated  using  a  least  square 
estimator  (LSE).  Then,  the  consequent  parameter  are  fixed  and  Gradient  descent  is 
used  during  back-propagation  to  optimize  the  premise  parameters.  These  two  steps 
are  alternated  until  the  network  converges  to  a  target  training  error  or  a  maximum 
number  of  epochs  is  reached.  A  detailed  description  of  the  two  main  steps  of  the 
hybrid  learning  is  provided  below. 

•  BackPropagation  Learning  Rule:  In  order  to  determine  the  update  rule  for 
premise  parameters,  first,  for  the  pth  training  pattern,  we  compute  a  squared 
error  measure  commonly  used  in  the  backpropagation  algorithm  and  defined  as 

Ep  =  ( tp  -  Op)2,  (2.59) 
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where  tp  is  the  desired  output,  and  Op  is  the  computed  output  of  the  network 
when  presented  with  training  sample  p.  Before  we  continue  with  the  derivation, 
we  want  to  point  the  reader’s  attention  that  during  the  backward  pass,  the 
consequents  parameters  are  fixed  and  only  the  premise  parameters  are  subject 
to  optimization. 

The  overall  error  measure  of  the  network  is  given  by 

N 

E  =  ^Ep.  (2.60) 

p=  i 

To  develop  the  gradient  descent  optimization  on  E,  we  compute  the  error  rate 
for  the  pth  training  and  for  each  node  output  Oi j.  This  error  rate  £ip  (1  <  l  <  5 
indicates  an  ANFIS  layer)  is  defined  as  follows 

e«-5oS''  =  1-"4-  (2'61) 

The  error  rate  at  the  output  node  is  given  as  following 


£5,1  — 


dEn 


dEn 


=  -2 (tp  -  O 


pr 


(2.62) 


305ii  dOp 

For  non-output  nodes  (i.e.  internal  nodes,  l  <  5),  we  use  the  chain  rule  to  derive 


the  error  rate 


_  dEp  _  Cai^+1)  dEp  d0l+l  h 
dOij  “  dOi+i,h  dOij 

where  Cai^d(l  +  1)  refers  the  number  of  nodes  at  layer  l  +  1. 

Next,  we  need  to  minimize  the  network  error  with  respect  to 

eters  {cpj ,  <Jkj  |  1  <  fc  <  r,  1  <  j  <  D}.  First,  we  compute 

respect  to  a  generic  parameter  6  using 


(2.63) 


the  premise  param- 
the  error  rate  with 


dEp 

d6 


^  dEp  dO* 
do*  d6 

o*eS 


where  S  is  the  set  of  nodes  whose  outputs  depend  on  6. 
Given  (2.60),  we  have 


dE 

~d0 


N 


E 


dEp 

dO  ' 


(2.64) 


(2.65) 
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The  error  rate  for  the  premise  parameters  ckj  and  akj  can  be  computed  using 


dEp  dEp  dO\ 5  <90 4  dOs  d02  dO\ 
dckj  dO 5  dO 4  dO 3  <902  <90 1  dckj 


and, 


dEp  _  dEp  dO$  d04  dOs  d02  dO\ 
dakj  dO 5  dO 4  SO3  <902  <90 1  dakj ' 


From  (2.62),  we  have 


dEp 

d05 


-2 (tp  -  Op). 


It  is  also  straightforward  to  show  that 


do5  d(ZU  04) 

<904  <904 


and, 


do4 

dOs 


=  fi  =  Pix  1  +  %x2  +  n. 

d[Wi ) 


Continuing  the  derivation,  we  have 


<90:3  _  dm  _  wi ) 

<902  <9rcj 


E[=i  u’i  ~  wk 

(ef-i  »>)2 


Next  we  compute  the  derivative  from  layer  2  to  layer  1 


<902 

aoi 


a 


IId=l  MMj  ixpd) 


D 

n  ^M\{xpdy 

d=l,d^j 


Finally,  we  have 


6*0 1 
dckj 


( xP,j  ckj  ) 


a 


x  exp(— 


kj 


( XPtj )  Ckj) 

2ah 


and, 


do  1  _  (Xpj  -  ckjy 


da , 


kj 


a' 


x  exp(— 


kj 


( Xp,j )  Ckj) 


(2.66) 

(2.67) 

(2.68) 

(2.69) 

(2.70) 

(2.71) 

(2.72) 

(2.73) 

(2.74) 


Thus,  the  update  equations  for  the  parameters  ckj  and  akj  are  given  by 


A  ckj  =  -p 


dE 

dckj 


(2.75) 
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and, 

dE 

A  <Jkj  =  -rj  w - ,  (2.76) 

0<Jkj 

where  r]  is  a  learning  rate  determined  in  a  similar  manner  to  that  of  standard 
backpropagation  algorithm  [50]. 

Equations  (2.75)  and  (2.76)  can  be  used  to  update  Ckj  and  parameters  either 
on-line,  or  in  a  batch  mode.  Next  we  develop  the  update  rules  for  the  consequents 
parameters. 

•  LSE:  The  Least  Squares  Estimator  (LSE)  is  used  to  minimise  the  squared  error 
|  \AB  —  T 1 12,  where  A  has  the  outputs  of  Layer  3,  and  B  has  the  set  of  consequent 
parameters  subject  of  optimization.  Initially,  the  parameters  are  identified  using 
(2.58).  Then,  in  the  subsequent  forward  passes  the  consequent  parameters  are 
obtained  using  the  pseudo-inverse  of  B,  i.e., 

B  =  (AtA)~1AtT,  (2.77) 

In  this  type  of  LSE  problems,  it  may  happen  that  {A1-  A)  is  a  singular  matrix. 
To  overcome  this  problem  a  recursive  version  of  LSE  can  be  used  [70] . 

The  derived  update  equations  are  used  in  an  iterative  algorithm  that  involves  suc¬ 
cessive  updates  of  the  premise  and  consequent  parameters.  The  ANFIS  learning 
algorithm  is  summarized  in  Algorithm  2.3. 
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Algorithm  2.3  ANFIS  Basic  Learning  Algorithm 


Inputs:  X:  the  set  of  training  pattern. 

T :  the  set  of  training  labels. 
ij:  the  learning  rate, 
e:  number  of  epochs. 

Outputs:  b ij\  the  sets  of  consequent  parameters. 

c ij:  the  set  of  membership  functions’  centers  (premise  parameters). 
c T{j\  the  set  of  membership  functions’  widths  (premise  parameters). 


Initialize  b ^  using  (2.58), 

Initialize  c ^  using  FCM. 

Initialize  <Jij  using  (2.57). 

repeat 

Update  bij  using  (2.77). 

Update  Cij  using  (2.73). 

Update  Oij  using  (2.74). 

until  parameters  do  not  change  significatively  or  number  of  epochs  is  exceeded 
return  b ij,  c ij,  Oij 
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CHAPTER  3 


MULTIPLE  INSTANCE  FUZZY  LOGIC 

In  this  chapter,  we  formalize  Multiple  Instance  Fuzzy  Logic  (MIFL).  MIFL  is  differ¬ 
ent  from  traditional  fuzzy  logic  in  that  it  allows  for  an  additional  dimension  of  ambiguity 
and  it  enables  fuzzy  reasoning  with  bags  of  instances  instead  of  a  single  instance  at  a  time. 
We  introduce  multiple  instance  variations  of  fuzzy  propositions,  fuzzy  if-then  rules,  and 
fuzzy  reasoning,  which  are  the  building  blocks  of  our  proposed  framework.  The  following 
formulation  is  inspired  by  the  work  of  Jang  et  al.  [70]  on  traditional1  fuzzy  logic. 

3.1  Multiple  Instance  Fuzzy  Propositions 

Recall  that  in  traditional  fuzzy  logic,  a  fuzzy  proposition  can  be  written  as 

p  :  X  is  A  (3-1) 

where  X  receives  values  x  from  a  universal  set  U  and  A  is  a  fuzzy  set  on  U.  For  example, 

a  proposition  can  be,  “ temperature  is  high" .  In  traditional  fuzzy  logic,  to  evaluate  the 

proposition  p  in  (3.1),  X  is  assigned  a  single  value,  say  “ temperature  =  90”,  this  will  lead 

to  “p  :  temperature  =  90  is  high" .  This  will  work  in  most  cases  even  if  X  is  a  vector  in  Mn. 

In  fact,  proposition  (3.1)  is  valid  as  long  as  X  is  expressed  by  a  single  instance.  However, 

for  multiple  instance  (MI)  data,  the  universe  of  discourse  consists  of  bags  of  instances  rather 

than  single  instances  and  the  proposition  needs  to  be  generalized  to  a  set  of  instances.  Let 

Bi  be  a  bag  of  Mj  instances.  The  )th  instance,  x.y,  is  a  D  dimensional  vector  with  elements 

Tn  the  remaining  of  this  proposal  previously  presented  fuzzy  logic  and  fuzzy  inference  material  will  be 
referred  to  as  “traditional”. 
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Xijk  corresponding  to  features,  i.e., 


-V 1 1  1 

Xil2 

XilD 

Xi2l 

Xj.22 

Xi2D 

XiMil 

XiMi2  ■ 

■  ■  XiMiD 

(3.2) 


Note  that  the  number  of  instances  can  vary  between  bags  (Mt  depends  on  Bi).  A  bag  is 
labeled  positive  if  at  least  one  of  its  instances  is  positive,  and  negative  if  all  of  its  instances 
are  negative. 

Definition  3.1.1.  Let  B  =  {B,\i  =  1  be  the  set  of  all  bags.  The  universe  of 

discourse  U  is  the  set  of  all  bags  of  a  given  problem  (U  =  B).  For  a  given  instance  of  a 
given  bag  Bi,  we  define  a  “proposition  instance”  as: 

Pj  :  Xjj  is  A,  (3.3) 

Definition  3.1.2.  We  define  a  multiple  instance  fuzzy  proposition  as  the  disjunction  of 
proposition  instances,  i.e., 

Mi  Mi 

q  :  Bi  is  A  «=>-  q  :  \J  pj  =  \J  (x,:j  is  A)  (3.4) 

3= 1  1=! 

In  (3-4)  “\J  ”  is  a  T-conorm  (maximum,  algebraic  sum,  bounded  sum,  etc.),  as  defined 
in  [78]. 

The  proposition  instance  (“x,j  is  A ”)  in  Definition  3.1.1  is  evaluated  as  in  (2.36), 
and  represents  the  degree  of  truth  of  the  proposition  on  a  single  instance.  Not  only  the  bag 
has  different  forms  of  expression  (or  instances) ,  the  proposition  it  self  has  different  instances 
of  truth.  It  follows  that  the  degree  of  truth  of  a  multiple  instance  fuzzy  proposition  is  a 
combination  of  degrees  of  truth  associated  with  the  proposition  instances.  (3.4)  is  analogues 
to  fuzzy  information  fusion  [79,80].  Fuzzy  information  fusion  deals  with  merging  uncertain 
observations  that  are  possibly  generated  by  heterogeneous  sources.  Thus,  it  is  possible  to 
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view  the  combination  of  degrees  of  truth  of  multiple  instances  as  a  a  fuzzy  information 
fusion  process.  In  the  following,  we  formalize  our  truth  instances  fusion  process. 

Let  denote  the  degree  of  truth  of  a  multiple  instance  fuzzy  proposition.  fiA(Bi) 

indicates  the  “membership  degree”  of  Bj  in  A  .  The  expression  in  (3.4)  can  be  simplified 
further  using  the  following  theorem. 


Theorem  3.1.3.  Let  B  be  a  collection  of  M  instances  drawn  form  an  instance  space  X, 
and  let  A  be  a  fuzzy  set  on  X .  The  multiple  instance  proposition  “q:  B  is  A  ”  is  equivalent 
to  the  following 

q  :  B  is  A  3x  G  X  \  fiA(B)  =  fJ,A(pc)  (3-5) 

Note  that  x  is  not  necessary  an  instance  of  B. 


Proof.  From  (3.4),  we  have  ftA(B)  =  Vy=i  lJA(Xj)-  and  we  know  that  the  T-conorm  “V”, 
aggregation  operator,  is  closed  in  [0, 1].  Thus,  the  aggregation  of  a  given  set  of  membership 
grades  remains  in  [0,1].  It  follows  that  fiA{B)  =  Vjli  Ma(xj)  6  [0,1].  Assuming  that  the 
fuzzy  set  A  is  normal  and  its  membership  function  pA{x)  is  continuous,  there  exists  x£l 
such  that 


M 

V  /L4(xj)  =  /U(x) 

3= 1 


(3.6) 


Hence, 


M 

V  Ma(xj)  =  M.4(x)  =  ftA{B) 
3=1 


(3.7) 

□ 


If  the  T-conorm  is  carried  using  a  max  operator,  then  fiA(Bi)  reduces  to 

fiA(Bi)  =  max{nA(x.ij),j  =  1 .  ..Mi}  (3.8) 

In  (3.8),  jlA(Bi)  is  the  highest  degree  of  truth  associated  with  the  proposition’s  instances. 
This  formulation  is  inline  with  the  standard  MIL  assumption  [36,39],  which  states  that  a 
bag  is  positive  if  and  only  if  one  or  more  of  its  instances  are  positive.  This  relation  will  be 
covered  in  more  details  when  we  introduce  multiple  instance  fuzzy  inference  in  chapter  4. 
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If  we  consider  the  example  of  the  image  classification  task  described  in  Section  1.1. 

In  this  case,  Bi  is  the  image  shown  in  Figure  1.1,  and  instances  are  the  12  patches  (marked 

by  black  squares).  In  this  case,  an  example  of  a  multiple  instance  proposition  can  be 

12 

q  :  image  is  blue  <*=>-  q  :  \J  patch3  is  blue.  (3-9) 

3= 1 

3.2  Multiple  Instance  Fuzzy  If-Then  Rules 

Recall  that  in  traditional  fuzzy  logic  a  fuzzy  if-then  rule  is  expressed  as 

if  x  is  A  then  y  is  B  (3.10) 

where  A  and  B  are  fuzzy  sets  on  universes  of  discourse  X  and  Y,  respectively.  As  presented 
in  chapter  2,  rule  (3.10)  combines  the  fuzzy  propositions  (p,  q)  into  a  logical  implication 
abbreviated  as  A  — >  B  with  membership  function  pa~^b(x,  y ).  The  rule  in  (3.10)  is  defined 
using  a  premise  part  that  is  a  single  instance  traditional  fuzzy  proposition.  Thus,  it  is  not 
suitable  to  carry  implications  on  multiple  instance  problems.  In  the  following,  we  introduce 
our  approach  to  multiple  instance  implication  that  will  lead  to  the  development  of  multiple 
instance  fuzzy  if-then  rules  and  multiple  instance  fuzzy  reasoning. 

3.2.1  Multiple  Instance  Fuzzy  Implication 

Definition  3.2.1.  Similarly  to  traditional  fuzzy  if-then  rides,  we  define  a  multiple  instance 
fuzzy  rule  as: 

Mi 

if  Bi  is  A  then  y  is  C  if  \J  (xy-  is  A)  then  y  is  C  (3.11) 

3= 1 

where  A  and  C  are  fuzzy  sets  on  the  universes  of  discourse  X  and  Y ,  respectively.  B,  is  a 
bag  of  instances  ,  and  M%  is  the  number  of  instances. 

The  premise  part  of  a  multiple  instance  fuzzy  rule  (i.e.  V)=i(x*j  * s  A)  )  is  a  multiple 
instance  proposition,  whereas  the  consequence  part  is  a  traditional  proposition.  An  example 
rule  is  given  as  follows, 

12 

if  image  is  blue  then  class  is  sky  if  \J  ( patch 3  is  blue )  then  class  is  sky  (3.12) 

3= i 
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As  before,  the  multiple  instance  rule  combines  the  premise  and  consequence  parts  into  a 
logical  implication.  However,  since  the  premise  part  is  a  multiple  instance  proposition  we 
will  refer  to  this  new  logical  implication  as  multiple  instance  implication.  It  is  a  fuzzy 
relation  on  the  product  space  B  x  C  ( B :  bags’  space).  Formally, 

R  =  A  >  C  =  A  x  C  =  [  p,A(Bi)*  nc(y)/(Bi,y)  (3.13) 

JbxY 

where  *  is  a  T-norm  and  A  x  C  is  used  to  represent  the  fuzzy  relation  R. 

Lemma  3.2.2.  There  exists  a  transformation  that  transforms  a  multiple  instance  fuzzy 
implication  to  a  traditional  fuzzy  implication. 

Proof.  Using  theorem  (3.1.3)  we  replace  by  ha{x)  and  rewrite  (3.13)  as 

R  =  A->C  =  AxC=[  ha{x)  *  nc(y)/ (x,  y)  (3-14) 

JxxY 

which  is  the  expression  of  a  traditional  fuzzy  relation.  □ 

Thus,  multiple  instance  fuzzy  implication  can  be  carried  using  traditional  fuzzy 
implication.  This  result  will  be  used  when  we  develop  multiple  instance  fuzzy  reasoning  in 
the  next  section. 

In  (3.13),  R  has  a  membership  function  denoted  /UA^c{Bi,y)  that  represents  the  degree  of 
truth  of  the  implication  when  B  is  equal  to  Bi  and  Y  is  equal  to  y.  Using  min  and  product 
as  implication  operators,  we  have: 

HA^c(Bi,y)=  P>A(Bi)AfjlB(y)/(x,y)=min\jji,A(Bi),fJ-c(y)]  (3-15) 

JBxY 

and, 

pA^c(Bi,  y)=  ij.Ai.Bi)  ■  yc{y)/ (x,  y)  =  VAiSf)  ■  yc(y)  (3.16) 

JBxY 

3.3  Multiple  Instance  Fuzzy  Reasoning 

Multiple  instance  fuzzy  reasoning  is  needed  when  the  universe  of  discourse  U  is  a 
“bag-space”  (i.e.  U  =  B),  i.e.,  every  element  is  a  bag  of  instances  rather  than  a  single 
instance.  In  this  case,  we  define  the  Generalized  Modus  Ponens  as 
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premise  if  Bi  is  A  then  y  is  C  if  Vj=i  i-^ij  is  A)  then  y  is  C 

fact  Bi  =  and  Xu  is  A\ ,  Xi2  is  A'2,. . .  ,XiM  is  A'M. 


consequence 


A  and  {A'-}1^-  are  fuzzy  sets  on  X  (the  instances  space),  and  C  is  a  fuzzy  set  on  Y.  Using 


U  is  C' 


the  composition  rule  of  inference,  we  determine  C  using 

Mi  Mi 

C  =  ( V  A'j)  o  (A  — »  C)  =  \J  (Aj  o  (A  — >  C)) 


3= 1 


3= i 


and  we  have, 


Mi 


yc'(y)  =  y  {rrMxx(min[iJ,A'.(x),fiA-+c(x,y)])} 

3= 1 

Using  min  as  implication  operator,  (3.18)  is  equivalent  to 

Mi 

VC’(y)  =  V  {maxx(min[nA',(x),min(iJA(x),  yc(y))])} 
3= i 

further  simplification  yields 

Mi 

VC’(y)  =  \f  {min[maxx(min[nA'.(x),  ha(x)]), nc(y)]} 

3= i 


which  is  equivalent  to 

MC"(?/)  =  ™n[  V  (x),/xa (*)])},  Mc(y) 

j=i 

For  instance,  if  the  “max”  aggregation  operator  is  used,  we  have 

yc’{y)  =  min\max{maxx(min[nAfXx),  HAix)])}^,  nc(y) 


Mi 


(3.17) 


(3.18) 


(3.19) 


(3.20) 


(3.21) 


(3.22) 


The  term  “max{maxx(min[nA>.  (x),  in  (3.22)  can  be  interpreted  as  the  rule 

J  J 

firing  strength  [70]. 

To  summarize,  the  proposed  multiple  instance  fuzzy  reasoning  involves  the  following 
3  main  steps: 

1.  Compute  the  multiple  instance  proposition  degree  of  truth,  i.e.  evaluate 
max{/iA'(x./j),  j  =  1 . . .  Mi}; 

2.  Compute  the  rule  firing  strength,  or  the  degree  of  belief  for  the  antecedent  part; 


3.  Compute  the  degree  of  belief  of  the  consequent  part  by  applying  the  “min”  operator. 
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3.4  Illustrative  Example 


Let  B  be  a  bag  of  three  instances  xi,  X2,  and  X3.  Let  A\ ,  A!2,  A':i  be  the  fuzzy  sets 
associated  with  the  instances.  Given  this  fact  we  want  to  evaluate  the  following  multiple 
instance  rule 


3 

if  B  is  A  then  y  is  C  <^=i>  if  \J  (x?;  is  A)  then  y  is  C  (3.23) 

3= 1 

where  A  and  C  are  fuzzy  sets,  defined  as  before.  Figure  3.1  illustrates  the  proposed  multiple 
instance  inference  process.  To  compute  the  rule  firing  strength  we  need  to  evaluate 


hc’iy)  =  min[max{maxx(min[iiAr(x),  nA(x)])}3j=1,  /j,c(y)\ 


(3.24) 


First,  we  compute  the  truth  instances  (the  shaded  area  in  the  premise  part  of  Figure  3.1). 


premise  part 


consequent  part 


Figure  3.1:  Illustration  of  the  multiple  instance  inference  process  using  the 

“max”  aggregation  operator.  Legend:  (1)  =  maxx{min{ii^{x),  /j,a(x))), 

(2)  =  maxx(min(nA'2(x),nA(x))),  (3)  =  maxx(min(fj,A'3(x),  ha(x))),  (4)  = 

max{maxx(min[nAr(x),HA(x)])}3i=1,  (5)  =  (xC'{y) 

3  J 


Then  all  truth  instances  are  aggregated  using  the  “max”  operator,  i.e.  we  select  the  highest 
truth  instance  as  the  rule  firing  strength.  Finally  the  membership  function  (MF),  nc{y), 
for  the  consequent  part  is  computed  as  the  MF  of  C  clipped  by  the  rule  firing  strength. 


3.5  Discussion 

Equation  (3.21)  defines  fuzzy  reasoning  with  bags  of  instances.  To  reach  this  goal,  we 
have  proposed  multiple  instance  variations  of  fuzzy  logic  building  blocks;  i.e.  propositions, 
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if-then  rules,  implications,  and  Generalized  Modus  Ponens.  Our  generalization  was  derived 
using  a  thoroughly  and  abstract  mathematical  formulation.  The  new  findings  will  be  used  to 
build  more  advanced  and  complex  fuzzy  inference  systems  as  will  be  shown  in  the  remaining 
chapters.  It  is  also  worth  noting  that  multiple  instance  fuzzy  logic  is  a  generalization  of  fuzzy 
logic,  in  fact  if  we  set  the  number  of  instances  in  each  bag  to  1,  all  presented  approaches 
will  reduce  to  those  of  traditional  fuzzy  logic. 

The  difference  between  our  multiple  instance  framework  and  fuzzy  logic  may  seem  subtle, 
but  we  think  there  is  an  important  contribution  to  point  out.  In  his  short  abstract  published 
in  2008,  titled  “Is  there  a  need  for  fuzzy  logic?”  [81],  Zadeh  wrote:  “Fuzzy  logic  is  not 
fuzzy.  Basically,  fuzzy  logic  is  a  precise  logic  of  imprecision  and  approximate  reasoning”. 
We  think  that  fuzzy  logic  is  powerful  at  modeling  knowledge  uncertainty  and  measurements 
imprecision.  More  generally,  it  is  one  of  the  best  frameworks  to  model  vagueness.  However, 
in  addition  to  uncertainty  and  imprecision,  there  is  a  third  vagueness  concept  that  fuzzy 
logic  does  not  address  quiet  well,  yet.  This  vagueness  concept  is  due  to  the  ambiguity  that 
arises  when  the  data  have  multiple  forms  of  expression,  this  is  the  case  for  multiple  instance 
problems.  Our  framework  deals  with  ambiguity  by  introducing  the  novel  concept  of  truth 
instances:  when  carrying  reasoning  using  multiple  instance  fuzzy  logic,  a  proposition  will  not 
only  have  one  degree  of  truth,  it  will  have  multiple  degrees  of  truth,  we  call  truth  instances. 
Thus,  effectively  encoding  the  third  vagueness  component  of  ambiguity  and  increasing  the 
expressive  power  of  traditional  fuzzy  logic. 

3.6  Related  Work 

Zadeh  introduced  fuzzy  sets  in  1965  [69]  and  fuzzy  logic  in  1973  [4],  After  that,  Mam- 
dani  and  Sugeno  followed  with  substantial  additions  [48,49,76].  Since  then,  many  other 
developments  and  extensions  to  the  fuzzy  theory  have  been  proposed.  Most  of  the  contribu¬ 
tions  can  be  classified  under  three  categories:  1)  contributions  that  propose  variations  and 
generalizations  of  fuzzy  sets,  2)  contributions  that  develop  new  fuzzy  logic  frameworks,  and 
3)  contributions  that  propose  new  fuzzy  inference  schemes.  For  instance,  Yager  introduced 
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a  new  type  of  fuzzy  sets  known  as  fuzzy  multisets  (fuzzy  bags)  [82],  Atanassov  proposed  in- 
tuitionistic  fuzzy  sets  [83],  and  more  recently  Torra  proposed  hesitant  fuzzy  sets  [84],  These 
approaches  can  be  classified  under  the  first  category.  Work  that  can  be  classified  under  the 
second  category,  include  complex  fuzzy  logic  [74,85,86]  and  complex  fuzzy  reasoning  [87]. 
Under  the  third  category,  we  can  cite  the  contribution  of  Kaburlasos  et  al.  [88]  that  con¬ 
sisted  of  an  extension  of  fuzzy  inference  systems  based  on  lattice  theory. 

To  the  best  of  our  knowledge,  there  have  been  no  proposed  variations  that  aimed  at  refor¬ 
mulating  fuzzy  logic  to  support  reasoning  with  multiple  instances  at  the  same  time.  The 
only  previous  work,  that  have  a  mention  of  fuzzy  and  MIL  in  the  same  framework,  was  pre¬ 
sented  by  Mahnot  et  al.  at  [89].  They  used  fuzzy  operators  to  compute  diverse  density  [39]. 
This  is  by  no  means  a  reformulation  of  fuzzy  logic  to  solve  the  multiple  instance  problem. 
While  there  are  no  directly  related  approaches  to  our  work,  most  methods  have  something 
in  common  as  they  aim  to  extend  fuzzy  logic  and  broader  its  domain  of  applicability.  For 
instance,  fuzzy  multisets  [82]  may  seem  to  be  related  to  our  approach  because  it  utilizes 
bags  of  elements  to  represent  objects.  A  fuzzy  multiset  can  be  defined  as  a  fuzzy  set  where 
multiple  occurrences  of  an  element  are  permitted.  Within  our  framework  it  can  be  used  to 
represent  the  results  of  bags’  fuzzification;  i.e. ,  the  membership  degrees  of  each  instance  in 
a  given  fuzzy  set.  Also  aggregations  operators  proposed  for  fuzzy  multisets  [90]  could  be 
used  in  our  proposed  extension. 

In  fact,  other  proposed  extensions  of  fuzzy  sets  could  be  adapted  to  the  context  of  multiple 
instance  fuzzy  logic.  For  example,  complex  fuzzy  sets  [85]  or  complex  fuzzy  classes  [91]  are 
based  on  fuzzy  sets  characterized  by  complex-valued  membership  functions.  Because  of  the 
two  dimensionality  nature  of  a  complex  fuzzy  set,  one  can  think  of  using  it  to  carry  rea¬ 
soning  with  bags  containing  two  instances  at  most.  This  later  formulation  is  not  necessary 
obvious  and  is  worth  investigating  in  future  research  projects. 
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3.7  Chapter  Summary 


In  this  chapter,  we  have  introduced  a  new  approach  for  multiple  instance  fuzzy 
logic.  This  approach  extends  traditional  fuzzy  logic  to  enable  reasoning  with  bags  rather 
than  single  instances.  In  particular,  we  have  introduced  multiple  instance  variations  of 
fuzzy  propositions,  fuzzy  implication,  fuzzy  if-then  rules,  and  fuzzy  reasoning.  We  have 
also  discussed  the  novel  concept  of  truth  instances.  In  the  next  chapter,  we  will  use  the 
presented  building  blocks  to  derive  new  styles  of  fuzzy  inference  systems. 
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CHAPTER  4 


MULTIPLE  INSTANCE  FUZZY  INFERENCE 

In  this  chapter,  we  introduce  our  approach  to  perform  fuzzy  inference  with  multiple 
instances.  More  specifically,  we  introduce  two  multiple  instance  fuzzy  inference  styles.  The 
first  one,  the  Multiple  Instance  Mamdani  style  fuzzy  inference  (MI-Mamdani) ,  extends  the 
traditional  Mamdani  style  inference  to  account  for  multiple  instances.  The  second  one, 
the  Multiple  Instance  Sugeno  style  fuzzy  inference  (Ml-Sugeno),  extends  the  Sugeno  type 
inference. 

4.1  Multiple  Instance  Mamdani  Style  Fuzzy  Inference 

The  traditional  Mamdani  inference  system  outlined  in  chapter  2  is  limited  to  reason 
with  individual  instances.  First,  the  system’s  input  is  an  individual  instance.  Second,  the 
rules  describe  fuzzy  regions  within  the  instances’s  space.  Third,  the  output  of  the  system 
corresponds  to  the  fuzzy  inference  using  the  D  dimensions  of  a  single  instance.  Fourth, 
labels  of  the  individual  instances  are  required  to  learn  the  parameters  of  the  system. 

In  MIL,  as  outlined  in  Section  2.1,  objects  are  described  by  bags  of  instances,  and  labels  are 
available  only  at  the  bag  level.  Thus,  the  standard  Mamdani  style  fuzzy  inference  system 
cannot  be  used  within  the  MIL  framework. 

In  the  following,  we  propose  a  generalization  of  Mamdani  fuzzy  inference  to  extend  it  to 
reason  with  bags  of  instances.  Similar  to  the  traditional  Mamdani  system,  we  formulate  the 
proposed  multiple  instance  Mamdani  system  (MI-Mamdani)  by  means  of  multiple  instance 
fuzzy  if-then  rules  that  can  evaluate  bags.  As  introduced  in  chapter  3,  multiple  instance 
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fuzzy  rules  can  be  expressed  using: 

Mp 

7?.* (Bp)  :  \J  (  If  .Tpji  is  Aj  and  xpj2  is  A2, . . . ,  and  xpjD  is  AlD),  then  o*  is  Cl .  (4.1) 

3= 1 

where  Bp  is  a  bag  of  Mp  instances  as  defined  in  (3.2).  In  (4.1),  Alk  is  a  fuzzy  set  associated 
with  the  kt\i  instance  feature,  and  “V”  is  a  T-conorm.  The  output  of  the  rule  is  described 
by  the  fuzzy  set  Cl . 

In  multiple  instance  fuzzy  reasoning,  the  antecedent  part,V^IPi(  if  ®PJ-i  is  A\  .. .  ,and  xpjD  is  A^), 
evaluates  the  degree  to  which  the  antecedent  fuzzy  sets  describe  each  instance  separately, 
then  all  responses  are  combined  into  a  rule  firing  strength  using  a  T-conorm.  Using  this 
inference  style,  the  rule  will  be  fired  if  and  only  if  there  exist  at  least  one  instance  in  the 
bag  that  can  be  described  by  means  of  the  antecedent  fuzzy  sets. 

The  reason  behind  using  a  T-conorm  for  combining  individual  instances’  responses, 
goes  back  to  the  standard  MIL  assumption  [36,  39]  which  states  that  each  instance  has  a 
hidden  class  label,  and  under  this  assumption,  an  example  is  positive  if  and  only  if  one 
or  more  of  its  instances  are  positive.  Thus,  the  bag-level  class  label  is  determined  by  the 
disjunction  of  the  instance-level  class  labels.  In  the  context  of  multiple  instance  inference,  if 
a  fuzzy  rule  describes  a  local  region  of  the  instances  space  that  happens  to  be  a  positive  MIL 
concept,  and  if  the  rule’s  output  is  high,  the  multiple  instance  fuzzy  rule  will  be  capable  of 
classifying  positive  bags  correctly.  This  is  because  at  least  one  instance  from  each  positive 
bag  will  activate  the  rule,  leading  to  a  high  output  (positive  label).  On  the  other  hand, 
negative  bags  will  not  be  able  to  significantly  activate  any  rule. 

Figure  4.1  illustrates  the  proposed  MI-Mamdani  system  and  its  fuzzy  inference  mechanism 
to  derive  the  output  z  in  response  to  a  bag  of  instances  for  the  simple  case  of  two  rules.  As 
it  can  be  seen,  the  premise  part  of  the  rules  evaluates  all  the  bag’s  instances  simultaneously. 
The  inference  starts  by  the  fuzzification  of  instances  xpm  of  a  given  bag  Bp.  Fuzzification 
assigns  a  membership  degree  to  each  input  instance  dimension  in  the  rules  input  fuzzy  sets. 
In  Figure  4.1,  instance  xpm  activates  the  ith  input  fuzzy  set  of  the  jth  rule  by  a  degree 
of  truth  lUmij  ■  Next,  an  implication  process  is  executed  to  combine  the  activations  of 
the  instances  within  the  bag  resulting  in  the  activation  of  the  rules’  output  with  different 
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premise  part 


consequent  part 


Xpl  —  [Xpll’Xpl2>  — -XplDl 


Xp2  ~  \Xp21‘Xp22>  » Xp2D] 


w  XpM  ~  [XpMl>  XpM2>  —  >XpMD\ 


Instance  2 


Instance  M 


n 


y  -\Y  r  r  1  Instance  1 

Apl  ~  L*pll>*pl2’—>*plDl  - 1 


bA 


Xp2  ~  \Xp2l>Xp22'  —  'Xp2D\ 


XpM  ~  [xpMl’XpM2'  —  'XpMo\ 


Instance  2 


Instance  M 


D  Features 


■* 


Z  (centroid  of  area) 


Figure  4.1:  Illustration  of  the  proposed  multiple  instance  Mamdani  fuzzy  inference  system. 


degrees.  In  this  example,  we  use  a  simple  min  operator,  and  the  output  of  rule  will 
be  partially  activated  by  a  degree  wmj  =  min^i  DWmkj ■  The  wmj  (truth  instances)  are 
combined  in  the  premise  part  using  the  max  T-conorm,  resulting  in  the  activation  of  rule  R 3 
by  a  degree  Wj  =  rnaxm=\t__^M'Wmj-  Next,  using  a  simple  max  operator,  the  2  output  fuzzy 
sets  are  aggregated  to  generate  one  output  fuzzy  set.  Finally,  the  output  set  is  defuzziffied 
(e.g.,  using  its  centroid)  to  generate  a  final  crisp  output  value. 

The  MI-Mamdani  inference  system  allows  the  use  of  different  T-conorms  on  different  rules. 
The  choice  of  the  appropriate  function  should  depend  on  the  application  and  the  purpose 
of  the  rule.  More  specifically:  should  the  rule  be  activated  if  at  least  one  instance  of  the 
bag  is  within  the  target  concept?  Or  should  it  be  activated  only  if  at  least  a  fixed  subset  of 
the  instances  are  within  the  target  concept? 

Finally,  we  should  note  here  that  if  we  set  Mp  to  1  (i.e. ,  constraint  all  bags  to 
include  only  one  instance),  (4.1)  reduces  to  a  traditional  fuzzy  if-then  rule  commonly  used 
in  Mamdani  FIS.  Thus,  the  proposed  MI-Mamdani  fuzzy  inference  system,  can  be  considered 
as  a  generalization  of  the  traditional  Mamdani  system. 
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4.2  Multiple  Instance  Sugeno  Style  Fuzzy  Inference 


The  rule  in  (2.49)  is  a  traditional  fuzzy  if-then  rule.  As  we  showed  in  chapter  3, 
this  type  of  rules  is  not  suitable  to  solve  multiple  instance  problems.  To  take  advantage 
of  the  Sugeno  inference  and  apply  it  to  problems  where  objects  are  described  by  multiple 
instances,  we  propose  the  multiple  instance  Sugeno  inference  (Ml-Sugeno)  system. 

Similar  to  the  MI-Mamdani  system  introduced  in  Section  4.1,  the  Ml-Sugeno  system  uses 
multiple  instance  fuzzy  if-then  rules  where  the  consequent  part  is  described  by  means  of  a 
function  C  that  maps  a  bag  of  instances  to  a  crisp  numerical  value.  Specificaly,  we  define 
a  multiple  instance  sugeno  rule  as: 

Mp 

1Z1(BP)  :  \/  (  If  xpji  is  A{, . . . ,  and  xpjD  is  AlD),  then  ol  =  C(xp i  •  b\  xp2  •  b\  . . . ,  xpMp  ■  b4)  (4.2) 

3= 1 

In  (4.2),  b*  =  6q,  ...,  Ud  is  a  set  of  polynomial  coefficients.  Similar  to  the  traditional  Sugeno 
fuzzy  model,  when  the  polynomial  coefficients  b1  are  first  order,  our  Ml-Sugeno  fuzzy  model 
is  called  first  order,  and  zero  order  when  the  polynomial  coefficients  are  zero  order. 

The  premise  part  of  the  rule  is  evaluated  as  in  the  MI-Mamdani  case.  To  evaluate  the 
consequent  part,  first  the  linear  response  of  each  instance  is  computed,  i.e.  xp]  •  b*.  Then  a 
function  C  is  used  to  compute  the  final  output  by  combining  the  instances’  response.  Many 
functions  could  be  used  and  the  choice  should  be  domain-specfic.  For  instance,  the  “max” 
function  has  been  used  in  many  applications. 

The  consequent  part  of  the  proposed  Ml-Sugeno  style  inference  system  is  inspired  by  the 
work  of  Ray  and  Page  on  multiple  instance  regression  [61].  In  their  work,  the  authors 
proposed  a  regression  framework  for  predicting  bags’  labels.  This  formulation  allows  the 
linear  coefficients  b*  and  the  parameters  of  the  combining  function  C  to  be  learned  using 
optimazation  techniques. 

Figure  4.2  illustrates  the  proposed  Ml-Sugeno  system  with  2  rules.  The  premise  part  of 
this  system  is  equivalent  to  MI-Mandani  (Figure  4.1).  Its  task  is  to  evaluate  each  multiple 
instance  rule  firing  strength.  In  the  consequent  part,  the  output  of  each  rule,  o1  and  o2,  are 
crisp  values  obtained  as  output  of  the  combining  function  C.  As  in  the  traditional  Sugeno 
fuzzy  inference  system,  the  overall  output  of  the  system  is  obtained  by  taking  the  weighted 
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average  of  the  rules’  outputs. 


premise  part  consequent  part 


Xp2  —  [Xp2VXp22>  —>Xp2d\ 


XpM  ~  l^pMl*  XpM2‘  ••• '  XpMDJ 


o2  =  C(xpi-b2.  xP2-b2. . . .  ,XpA/  b2) 


I 


Weighted  Average 


Figure  4.2:  Illustration  of  the  proposed  multiple  instance  Sugeno  fuzzy  inference  system 


Similar  to  traditional  fuzzy  inference,  the  premise  part  of  a  multiple  instance  rule 
defines  a  local  fuzzy  region  within  the  instance  space,  and  the  consequent  part  describes 
the  characteristics  of  the  system’s  output  within  that  region.  More  specifically,  in  multiple 
instance  learning  (MIL)  problems,  a  local  region  describes  a  positive  concept,  and  the  output 
of  a  rule  represents  the  degree  of  “positivity”  of  the  instances  in  that  local  region. 


4.3  Learning  the  Structure  and  Parameters  of  Multiple  Instance  Fuzzy  Infer¬ 
ence  Systems 

The  most  important  task  in  fuzzy  inference  with  both  MI-Mamdani  and  Ml-Sugeno 
systems  is  the  identification  and  learning  of  the  system’s  structure  and  its  parameters. 
Structure  identification  consist  of  identifying  the  number  of  multiple  instance  if-then  fuzzy 
rules,  identifying  the  membership  functions  (MFs)  of  the  premise  and  consequent  parts 
(i.e.  Gaussian  MFs  or  Trapezoidal  MFs?),  and  also  the  T-conorms  (min,  max,  product 
. . . )  involved  in  the  multiple  instance  fuzzy  reasoning.  After  structure,  the  parameters  of 
the  membership  functions  need  to  be  learned.  For  example,  for  a  Gaussian  MF  we  need 
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to  specify  the  mean  and  the  standard  deviation.  In  addition,  in  the  case  of  an  Ml-Sugeno 
system  we  need  to  initialize  the  polynomial  coefficients. 

The  system’s  structure  and  parameters  identification  rely  mainly  on  determining  the  char¬ 
acteristics  of  the  local  regions  within  the  instances’  space  that  characterize  positive  bags. 
In  traditional  (i.e. ,  single  instance  representation)  fuzzy  modeling,  this  task  is  achieved 
through  input  space  partitioning,  typically  using  grid  partitioning  and  clustering  [70].  In 
multiple  instance  inference  systems,  we  need  to  identify  regions  that  are  defined  by  positive 
instances,  referred  to  as  positive  concepts.  Since  in  MIL,  data  is  labeled  at  the  bag  level 
and  not  at  the  instance  level,  traditional  space  partitioning  methods  could  not  be  used  to 
learn  the  multiple  instance  fuzzy  rules. 

In  the  following,  we  describe  our  proposed  approach  to  identify  multiple  instance  fuzzy  rules 
based  on  a  fuzzy  clustering  algorithm  of  multiple  instance  (FCMI)  data  [59].  FCMI  identi¬ 
fies  target  concepts  that  correspond  to  dense  regions  in  the  instance  space  that  include  as 
many  positive  instances  as  possible  and  as  few  negative  instances  as  possible.  In  particular, 
we  define  the  permise  parts  of  the  MI-FIS  rules  as  local  contexts  within  the  input  space 
(instances’  space)  that  coincide  with  the  identified  target  concepts. 

Assume  that  we  have  N  training  bags,  B  =  {Bi\i  =  1, . . . ,  N},  and  the  set  of  their  corre¬ 
sponding  labels,  T  =  {U\i  =  1, . . . ,  N}.  Let  T  =  {C i, . . . ,  Cr},  be  r  target  concept  points. 
Each  target  concept  Ci  is  characterized  by  a  center  a  G  and  a  feature  relevance  scale 
vector  Si  G  .  The  FCMI  algorithm  maximizes  a  fuzzy  Multiple  Concept  Diverse  Density 
(MDD)  measure  [59]  defined  as: 

N  r 

MDD( T,U)  =  n  n  [Pr{Ci\Bn)]uZ.  (4.3) 

n= 1  2=1 

In  (4.8),  U  =  [uin]  is  a  membership  matrix  such  that  each  bag  Bn  is  assigned  to  target 
concept  Ci  with  membership  degree  Uin,  and  m  is  a  fuzzifier  that  controls  the  fuzziness  of 
the  partitions  as  in  the  FCM  [60].  Pr{Ci\Bn)  is  the  probability  that  Ci  is  a  target  concept 
given  Bn,  and  defined  as 

(  1  -  n£i(!  -  Pr{xnk  G  Q))  if  lable(Bn)  =  1, 

Pr(Ci\Bn)  =  l  (4.4) 

^  rifelil1  -  Pr(xnk  G  Ci))  if  lable(Bn)  =  0 
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where  label(Bn)  is  the  label  of  bag  and  xnk  is  the  kth.  instance  of  bag  Bn.  Pr(Xnk  G  Ct)  is 
regarded  as  the  similarity  of  instance  Xnk  to  target  concept  Ci,  and  its  computed  using 

Pr(Xnk  G  Ci)  =  e_ Sij (Xnkj _c^ )2)  (4.5) 

In  (4.5),  is  a  scaling  parameter  that  weighs  the  role  of  feature  j  in  target  concept  i  [39]. 
Let  {C°pt  =  {c°pt,  s°pt}}l=1  be  the  optimal  target  concepts  identified  by  FCMI  that  max¬ 
imizes  (4.8).  Let  T  =  {Ci, . . . ,  Cr},  be  the  r  target  concept  points.  For  simplicity,  we 
assume  that  the  MFs  of  the  r  multiple  instance  rules  are  Gaussian  MFs,  with  centers  Cij, 
i  =  1, . . .  ,r,  and  j  =  1 ,D.  For  a  given  multiple  instance  rule  1Z\  the  centers  of  the 
premise  part’s  MFs  are  the  centers  of  the  target  concepts,  i.e., 

=  Cij ,  for  j  =  1,...,D.  (4.6) 

The  diverse  density  of  each  concept  decreases  gradually  as  we  move  away  from  Ct.  In¬ 
tuitively,  the  width  <jij  of  a  given  concept  Ci  along  dimension  j  can  be  set  to  the  radius 
beyond  which  MDD  is  lower  than  a  diverse  density  threshold  n.  Formally,  the  standard 
deviations,  { } ,  can  be  computed  as  following: 

atj  =  min  {| Cij  —  Zj\  s.t.  MDDi(Z)  <  r*},  (4.7) 

ZgX 

In  (4.7),  Z  is  the  set  of  all  instances,  Z  is  a  D  dimensional  vector  and  r,  is  constant,  typically 

1  N 

n  =  - MDDi{Ci )  =  J]  [Pr(Ci\Bn)]u^.  (4.8) 

n=  1 

To  identify  the  rules’  consequent  parts  we  can  employ  one  of  the  following  two 
strategies: 

1.  The  consequents  parts  of  multiple  instance  fuzzy  rules  are  set  to  the  singleton  fuzzy 
set  {1}.  Using  this  strategy,  positive  bags  that  activate  a  rule,  lead  to  rule’s  output 
of  1.  This  is  inline  with  standard  MIL  assumption  given  that  rules  describe  positive 
concepts. 

2.  Treat  concepts  as  regular  contexts.  For  each  multiple  instance  fuzzy  rule,  its  con¬ 
sequents  fuzzy  sets’  parameters  are  identified  by  considering  the  ratio  of  positive  to 
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negative  instances  within  the  context  described  by  the  multiple  instance  fuzzy  rule. 
For  example,  if  a  context  has  90%  instances  from  positive  bags,  then  a  consequent 
MF  can  be  set  to  a  Gaussian  with  center  equals  to  0.9  and  a  predefined  standard 
deviation  e. 

4.3.1  Illustrative  Example 

To  illustrate  the  proposed  multiple  instance  fuzzy  rules  and  the  ability  to  learn 
from  data  without  instance-level  labels,  we  use  a  simple  synthetic  dataset.  The  data  were 
generated  from  a  distribution  of  two  positive  contexts,  marked  with  squares  in  Figure  4.3. 
From  each  positive  concept  we  generated  50  bags.  We  also  generated  50  negative  bags 
randomly  from  non  concept  regions.  The  number  of  instances  within  each  bag  is  a  random 
number  between  2  and  10  instances.  The  data  is  shown  in  Figure  4.3.  Instances  from 
negative  bags  are  shown  as  and  instances  from  positive  bags  are  shown  as  “+”  or  “A” 
depending  on  the  underlying  concept.  In  Figure  4.3,  we  highlight  one  bag  from  Concept 
1  by  circling  all  of  its  instances.  As  it  can  be  seen,  one  instance  is  close  to  a  dense  region 
of  a  positive  concept  while  the  other  instances  are  scattered  around.  We  note  here  that 
the  centers  of  concepts  in  Figure  4.3  are  shown  just  for  the  purpose  of  explanation  and 
validation.  We  do  not  use  this  information  as  it  is  not  available. 

First,  we  run  FCMI  [59]  to  identify  target  concepts.  These  points  are  then  used  to  identify 
the  parameters  of  the  fuzzy  rules.  Next,  for  the  rules’  consequents  identification,  we  set  the 
output  MFs  to  the  singleton  fuzzy  set  {1}.  This  will  ensure  that  bags  that  have  instances 
within  the  positive  concepts  will  get  assigned  high  output.  Finally,  all  the  rules’  parameters 
are  used  within  an  MI-Mamdani  fuzzy  inference  system  composed  of  two  multiple  instance 
fuzzy  rules,  each  with  two  inputs  and  one  output.  A  graphical  representation  of  this  system 
is  shown  in  Figure  4.4.  It  can  be  seen  that  the  centers  of  MFs  identified  using  FCMI  match 
the  centers  of  positive  concepts  shown  in  Figure  4.3.  To  test  the  system,  we  generate  3  bags 
of  instances:  2  positive  bags  and  1  negative  bag.  The  instances  of  these  bags  are  displayed 
in  Figure  4.5.  The  multiple  instance  fuzzy  inference  using  the  MI-Mamdani  system  of 
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Figure  4.3:  Instances  from  positive  and  negative  bags  drawn  from  data  that  have  2  concepts. 
Instances  from  negative  bags  are  shown  as  “.” ,  and  instances  from  positive  bags  are  shown 
as  “+”  or  “A”.  Instances  from  one  sample  positive  bag  are  circled. 

the  3  test  bags  is  summarized  in  Figure  4.6.  The  inference  starts  by  fuzzification  of  all 
the  instances  at  the  same  time,  as  illustrated  in  Figure  4.6a,  then  multiple  instance  fuzzy 
implication  process  is  executed  resulting  in  the  activation  of  the  rules’  output  with  different 
degrees  (each  degree  of  activation  is  a  firing  strength  as  defined  in  Section  3.3).  Next,  using 
a  simple  max  operator,  the  2  output  fuzzy  sets  are  aggregated  to  generate  one  output  fuzzy 
set.  Finally,  the  output  set  is  defuzzified  using  its  centroid  weighted  by  the  maximum  rule 
firing  strength.  The  weighting  ensures  that  negative  bags  that  do  not  activate  any  rule  will 
always  have  a  low  output. 

In  addition,  we  notice  that  while  both  first  and  second  bags  are  positive,  the  inference 
process  assigned  a  lower  degree  of  belief  to  the  second  bag  and  as  a  consequence  a  lower 
output  value.  This  will  not  impact  classification’s  results  as  negative  bags  will  not  be  able 
to  activate  any  of  the  rules  with  a  significant  degree.  But  it  will  rather  give  applications  an 
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Figure  4.4:  Illustration  of  MI-Mamdani  fuzzy  inference  system  learned  using  FCMI 


Test  bags 


-  a  :  positive  bag  1 
_  +  :  positive  bag  2 

-  ♦ :  Negative  bag  1 


0.4  0.6 


1.4  1.6 


Figure  4.5:  Instances  from  2  positive  and  1  negative  bag. 


assessment  about  the  confidence  of  the  prediction. 
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(a)  Inference  process  with  the  first  positive  bag  . 


Ml-Rule  1 


Diml  Dim2  Output 


(b)  Inference  process  with  the  second  positive  bag. 


Ml-Rule  1 


Diml  Dim2 


0  0.5  1  1.5  2  0  0.5  1  1.5  2 


Rulel 

firingstrenght 


(c)  Inference  process  with  the  negative  bag. 


Figure  4.6:  Multiple  instance  fuzzy  inference  using  the  learned  MI-Mamdani  system.  The 
level  of  the  activation  indicates  the  membership  degree  of  a  bag  in  a  given  concept  (i.e., 
rule).  The  system  defuzzified  output  is  the  final  confidence  value. 
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4.4  Chapter  Summary 


In  this  chapter,  we  used  our  multiple  instance  fuzzy  logic  framework  to:  1)  derive  a 
multiple  instance  Mamdani  fuzzy  inference  style,  and  2)  derive  a  multiple  instance  Sugeno 
fuzzy  inference  style.  We  have  also  presented  a  method  to  learn  multiple  instance  rules  from 
data  to  solve  MIL  problems.  The  FCMI  algorithm  is  used  to  extract  concept  points  in  the 
instances’  space  which  are  then  transformed  into  multiple  instance  rules.  This  approach 
is  essentially  based  on  intuition.  Although  premise  and  consequent  parameters  of  the  MI- 
Mamdani  and  Ml-Sugeno  systems  can  be  learned  from  data,  the  processes  of  identifying 
both  set  of  parameters  are  independent.  In  the  next  chapter,  we  introduce  a  neuro-fuzzy 
architecture  capable  of  learning  from  ambiguously  labeled  data  without  having  to  use  FCMI 
to  identify  multiple  instance  rules,  and  can  jointly  learn  the  set  of  the  optimal  premise  and 
consequent  parameters  using  the  backpropagation  algorithm. 
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CHAPTER  5 


MI-ANFIS:  A  MULTIPLE  INSTANCE  ADAPTIVE  NEURO-FUZZY 

ARCHITECTURE 


In  this  chapter,  we  introduce  an  adaptive  neuro- fuzzy  architecture  based  on  the 
framework  of  multiple  instance  fuzzy  logic,  that  is  designed  to  handle  reasoning  with  bags 
of  instances  as  input  and  capable  of  learning  from  ambiguously  labeled  data.  The  new 
architecture  called  Multiple  Instance-ANFIS  (MI-ANFIS),  is  an  extension  of  the  standard 
Adaptive  Neuro  Fuzzy  Inference  System  (ANFIS)  [50]. 

In  the  following,  we  describe  the  architecture  of  the  proposed  MI-ANIFS  and  introduce  a 
corresponding  learning  algorithm. 


5.1  MI-ANFIS  Architecture 


Figure  5.1:  Architecture  of  the  proposed  multiple  instance  Adaptive  Neuro-Fuzzy  Inference 
System 


Let  Bp  be  a  bag  of  Mp  instances  as  defined  in  (3.2).  For  simplicity,  we  introduce  our 
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MI-ANFIS  for  the  case  of  two  rules.  The  generalization  to  an  arbitrary  number  of  rules  is 
trivial.  The  MI-ANFIS  with  two  Sugeno  rules  can  be  described  as: 

Mp 

n\Bp)  :  V  ( If  Xpji  is  An, ... ,  and  xpjD  is  A1D),  then  /i  =  C(xpl  •  b1, . . . , xpMp  ■  b1) 

i=1  (5-1) 

TZ2(Bp)  :  V  ( If  xpJ i  is  A2i,  . . . ,  and  xpjD  is  A2D),  then  f2  =  C(xpl  •  b2, . . . ,  xpMp  •  b2) 

3=1 

Figure  5.1  illustrates  the  proposed  MI-ANFIS  architecture.  As  in  the  traditional  ANFIS, 
nodes  at  the  same  layer  have  similar  functions.  We  denote  the  output  of  the  zth  node  in 
layer  l  as 

Layer  1  is  an  adaptive  layer,  it  calculates  the  degree  to  which  a  given  input  instance 
satisfies  a  quantifier  A.  Every  node  evaluates  the  membership  degree  of  an  input 
instance  in  the  fuzzy  set  Akj  of  membership  function  pAk  :I  •  Generally,  fiAk  ■  is  a 
parameterized  membership  function  (MF),  for  example  a  Gaussian  MF  with 

»Aktj(x)  =  exp(  X  C ^  ).  (5.2) 

In  (5.2),  Ckj  and  Okj  are  the  mean  and  variance  of  the  gaussian  function,  and  are 
referred  to  as  the  premise  parameters. 


Layer  2  is  a  fixed  layer  where  every  node  computes  the  product  of  all  incoming  inputs.  It 
evaluates  the  degree  of  truth  of  proposition  instances,  or  simply,  “truth  instances”. 
The  output  of  this  layer  is 

D 

°2’*  =  r\i/MP]  ,i[MP]  =  n  ^A\i/Mp\ j (xpAmp]j),  (5-3) 

where  |~~|  is  a  ceiling  operator,  and  i[Mp\  is  i  mod  M .  As  in  the  traditional  ANFIS, 
any  T-norrn  can  be  used  as  the  node  function  in  this  layer. 


Layer  3  is  a  new  addition  when  compared  to  the  traditional  ANFIS.  Every  node  aggregates 
the  truth  instances  of  the  previous  layer  by  means  of  a  smooth  T-conorm.  We  use 
a  smooth  approximation  of  the  “max”  T-conorm  known  as  the  “softmax”  function 


(5a): 


n 

softmaxQ(x  i,x2,  ...,xn)  =  Sa(x  1,x2,  ■  ■  ■  ,xn)  =  ^ 

i=  1 


Xi  •  e 


axi 


En 
7=1 


gOLXj 


(5.4) 
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In  (5.4),  a  determines  how  closely  softmax  approximates  the  max  operator.  As  a 
approaches  oo  ,  softmax’s  behavior  approaches  max.  When  a  =  0,  it  calculates  the 
mean.  As  a  approaches  — oo,  softmax’s  behavior  approaches  the  min  operator.  The 
outputs  of  this  layer  are  the  firing  strength  of  the  multiple  instance  fuzzy  rules  defined 
by  layers  1  through  layer  3.  i.e. , 

O^i  =  Wi  =  Sattnj}^).  (5.5) 

Layer  3  is  also  a  fixed  layer. 


Layer  4  is  a  fixed  layer.  Every  node  in  this  layer  calculates  the  normalized  firing 
strength  of  each  rule,  i.e., 


Wi 


O4  i  —  W{  —  \n  \ 

Eft1  *»; 


(5.6) 


where  | O3 1  is  the  number  of  rules. 


Layer  5  is  an  adaptive  layer.  Every  node  i  in  this  layer  computes  the  output  of  the  V th 
multiple  instance  rule  using 


05ti  =  C (xpi  •  b*,  xp2  •  b*, . . . ,  xpMp  •  b?'),  (5.7) 

The  parameters  {b*}[£^  are  referred  to  as  the  consequent  parameters.  The  only 
constraint  on  C  is  that  it  has  to  be  a  smooth  function  to  allow  for  optimization 
techniques  to  be  applied.  We  use  the  “softmax”  as  the  combining  function  for  this 
layer.  In  this  case,  (5.7)  is  equivalent  to: 

05ii  =  WiSa{xp  1  •  b\  xp2  •  b*, . . . ,  XpMp  •  b*),  (5.8) 

note  that  the  constant  a  here  is  not  necessary  the  same  as  in  Layer  3. 

Optionally  an  activation  function  (such  as  Tanh  or  Sigmoid)  could  be  applied  to  the 
individual  linear  responses  (i.e.,  xpj  ■  b*).  This  has  advantage  of  protecting  against 
the  exploding  gradient  problem  when  using  the  backpropagation  algorithm  [92]. 
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Layer  6  is  a  fixed  layer  with  a  single  node  labeled  S.  As  in  the  traditional  ANFIS,  it 


computes  the  overall  output  of  the  system  using 
IO3I  IO3I 

Oe,  1  =  ^  °5,i  =  Y2  ^*5a(xpi  •  xP2  •  b4, . . . ,  XpMp  •  b*).  (5.9) 

i= 1  i=l 

Algorithm  5.1  highlights  MI- ANFIS  frowrad  pass. 

Algorithm  5.1  MI- ANFIS  Forward  Pass  Algorithm 


Inputs:  B:  the  set  of  training  bags. 

T :  the  set  of  training  labels. 

M :  the  number  of  instances  in  each  bag. 

R :  the  number  of  rules. 

D:  the  dimensionality  of  instances. 
a:  the  constant  used  in  the  “softmax”  function. 
Outputs:  06,i:  the  output  of  the  network. 

for  each  instance  do 
for  j  from  1  to  D  do 
for  k  from  1  to  R  do 

Fuzzification  of  inputs  using  (5.2). 

end  for 
end  for 
end  for 

for  each  instance  do 
for  each  rule  do 

Evaluation  of  truth  instances  using  (5.3). 

end  for 
end  for 

for  each  rule  do 

Compute  activation  degree  using  (5.5). 

Computed  normalized  activation  degree  using  (5.6). 

end  for 

for  each  instance  do 
for  j  from  1  to  D  do 
for  k  from  1  to  R  do 

Evaluate  the  output  of  0, 5^  using  (5.7). 

end  for 
end  for 
end  for 

Evaluate  the  overall  outputOe.  1  using  (5.9). 

return  Ogj 
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5.2  Basic  Learning  Algorithm 

To  identify  the  parameters  of  the  proposed  MI-ANFIS  network,  we  propose  a  varia¬ 
tion  of  the  basic  learning  algorithm  presented  by  Jang  [50]  (also  outlined  in  chapter  2  of  this 
dissertation).  Our  variation  is  different  from  the  ANFIS  standard  backpropagation  learn¬ 
ing  rule  due  to  the  additional  layers  and  the  use  of  “softmax”  function.  Thus,  all  update 
equations  need  to  be  rederived. 


5.2.1  BackPropagation  Learning  Rule 


In  the  following,  we  assume  that  we  have  N  training  bags,  B  =  {Bp  \P=  1,  •  •  - ,  AT}, 
and  their  corresponding  labels  T  =  {tp  \  p  =  1, ...  ,N}.  After  presenting  the  pth  training 
bag,  we  compute  its  squared  error  measure  commonly  used  in  the  backpropagation  algorithm 
and  defined  as 

Ep  =  0 tp  -  Op)2,  (5.10) 

In  (5.10),  tv  is  the  desired  bag  output,  and  Op  is  the  computed  output  of  the  network  when 
presented  with  training  bag  Bp.  Equation  (5.10)  demonstrates  the  need  for  MI-ANFIS.  In 
fact,  due  to  the  absence  of  instances’  labels,  errors  can  be  computed  only  at  the  bag  level. 
Errors  at  the  instance  level  cannot  be  computed  and  are  not  needed  as  we  will  show  later. 
The  overall  error  measure  of  the  network  after  presenting  all  N  bags  is 

N 

E  =  YJEV.  (5.11) 

p=  i 


To  develop  the  gradient  descent  optimization  on  E,  we  compute  the  error  rate  for 
the  pth  training  bag  and  for  each  node  output  Oi  j.  This  error  rate  en  (1  <  l  <  6  indicates 
an  MI-ANFIS  layer)  is  defined  as 


£l,i  — 


dEp 
dOi,i ' 


(5.12) 


The  error  rate  at  the  output  node  is  given  as  following 


dEp  _  dEp 

<906,i  dOp 


-2{tp  -  Op). 


(5.13) 
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For  non-output  nodes  (i.e.  internal  nodes,  l  <  6),  we  derive  the  error  rate  using  the  chain 


£ Li 


Card(l-\- 1) 

E 


0EP  dOi+ifi 

dOi+i'h  dOi:i 


(5.14) 


where  Card(l  +  1)  refers  the  number  of  nodes  at  layer  l  +  1. 


Next,  we  seek  to  minimize  the  network  error  with  respect  to  the  premise  parameters 
{ckj,crkj  |  1  <  k  <  | O3 1 , 1  <  j  <  D},  and  with  respect  to  consequents  parameters  • 

The  error  rate  with  respect  to  a  generic  parameter  0  can  be  computed  using 


y-  dEpdO* 
dO*  89  ’ 


(5.15) 


where  G  is  the  set  of  nodes  whose  outputs  depend  on  9. 


Using(5.11),  the  total  error  rate  is  given  by 


y^  dEp 

2^  riff  ' 


(5.16) 


5.2. 1.1  Update  Rule  For  Premise  Parameters 


First  we  compute  the  error  rate  for  the  premise  parameters  ckj  and  crkj  using 


dO(i,;+[(fc-i)D+(j-i)]Mp) 


dco  do{1)i+Kk_1)D+{j_1)]Mp) 


(5.17) 


dEp  —  y-\  dEp  dO  (1  ,i+[(fe_i)D+(j_i)]Mp) 

d&kj  50(iji+[(fc_i)i)+(J-_i)]Afp)  dakj 

Using  the  chain  rule  defined  in  (5.14),  we  have 

dEp  dEp  dOe,i  d05yk  dOA,k  d03,k 

_ £_ _  —  _ r  y^  _ l _  y^  _ Z_  y^  _ !_  y^  _ : _ 

^(i,i+[(fc-i)D+(j-i)]Mp)  dOe,  1  d05jk  dOi)k  d03)k  dO(2,i+(h-i)Mp) 

dO(2,i+(k-l)Mp) 

dO(l,i+[(k—l)D+(j—l)]Mp) 


Hence,  (5.17)  is  equivalent  to 

dEp  =  dEp dO(jA  dO^k  dO±k 
dckj  d06, 1  X  d05:k  X  d0^k  X  d03jk 


'(2,i+(k-l)Mp) 


dO(2,i+(k-l)Mp)  90(iji+p_i)£)+(i_i)]Mp) 


(5.18) 


(5.19) 


(5.20) 
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From  (5.13),  we  have 


<906,1 


=  -2(tp  -  Op). 


(5.21) 


It  is  also  straightforward  to  show  that, 


dOe.i  g(Eg  05,) 

dO^k  90^^ 


(5.22) 


<905ifc  _  <9(wfccSQ(xpi  •  bfc,  xp2  •  bfc, . . . ,  XpMp  •  bfc)) 
504ifc  d(wfc) 


Continuing  with  the  derivation 


=  Sa (xpi  •  bfc,  Xp 2  •  bfc, . . . ,  XpMp  •  bfc). 

(5.23) 


dP4,  k  =  dm  =  ^5' 

d03)k  dwk  dwk 


L  =1  wi  ~  wk 

(ri2'«Oa  ' 


(5.24) 


C03,fc  _  95a({rfcj}J-J’1)  _  earMi+(fc-i)MP)  r  /  Mp 

nFi -  —  a -  ~  — m -  l+a(  rk,(i+(k-l)Mp)~Sa\\r fc,m}m=i 

dO(2,i+(k-l)Mp)  drk,(i+(k-l)Mp)  J2m= ie<ir*’m  ^ 


(5.25) 


The  details  of  the  derivation  of  the  “softmax”  function  details  can  be  found  at  [39] . 
Next,  we  need  to  compute  ln  d°(v+(fc-i)Mp) - 


'(2,i+(k-l)Mp) 


2.  (  |"(i+(fc-l)Mp)/Mp]  ,d 


xp,(i+(k—l)Mp)[Mp],d 


(5.26) 


0(l,i+[(fc-l)D+(i-l)]Mp)  -  fJ-AkJ(xp,(i+(k-l)Mp)[Mp\,d}- 


(5.27) 


Thus, 


au(2,i+(k-l)Mp) 

\l,i+[(k-l)D+(j-l)]Mp) 


d\  Ud=lVA 


(  (i+(k-l)Mp)/Mp  ,d) 

d(^MAkj  ( Xp,(i+(k—l)Mp)[Mp\,d)'j 


xp,(i+(k—l)Mp)[Mp],d 


d=l,d^j  "(I  (*+(fc  — 1) Afp)/Mp”|  ,d ^ 


cP,(i+(k—  l)Mp)[MP],d 


(5.28) 
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Finally,  we  have 

dO(l,i+[(fc-l)Z)+0_l)]Mp)  _  dVAkJ{xp,(i+(k-l)Mp)[Mp],j) 


dc, 


kj 


dckj 

(x (p,(i+(k-l)Mp)[Mp\,j)  ~~  ckj ) 


Gkj 


x  exp(— 


( 1 x  (p,  (i+ ( k - 1 )  Mp )  [  Mp  ]  ,j )  ckj ) 


Substituting  the  derivatives  in  (5.20),  we  obtain 


dEn 


dc, 


kj 


—  2 (tp  Op)  x  Sa(x.p\  •  b  ,  Xp2  b  , . . . ,  ~x.pMp  b  )  x 


v-^|C>3| 

k\  w  z2l= 1  wl  -  wk 


v^lOal 

U= 1  u’i 


M, 


X 


E 


A  /  eark,(i+(k-l)Mp) 


\  VMp  Park,m  *- 

i=l  \  2-^nn=l  e 
D 

x  n 


1  +  <3!  (  ^fc,(j+(fc— 1)MP)  c^o({rfc,m}m=l 


11  ^/r  n  \ 

d=l,d^j  (  (i+(fc-l)Mp)/Mp  ,d  1 


xp,(i+(k—l)Mp)[MP\,d 


w  (x(p,(j+(fe-l)Mp)[Mp]J)  ckj)  _ _ [  (x(p,(i+(k-l)Mp)[Mp\,j)  ckj)2  ^ 

X  0  x  expy  2  / 


2<- 


(5.29) 


(5.30) 


As  in  the  standard  ANFIS,  an  update  formula  for  the  parameter  c^-  is  given  by 

dE 


A  ckj  =  -rj 


dckj 


(5.31) 


where  ij  is  a  learning  rate  determined  in  a  similar  manner  to  that  of  the  standard  backprop- 
agation  algorithm  [50]. 

The  update  formula  for  akj  can  be  derived  in  a  similar  manner.  It  can  be  shown 

that 

dO(l,i+[(fc-l)D+(j-l)]Mp)  _  dlJ,Ak!j{xPj(i+(k-l)Mp)lMp],j)  _ 


da, 


kj 


da 


kj 


ix(pXi+(k-l)Mp)[Mp]J)  ckj )  ^  (x{p,(i+(k-l)Mp)[Mp],j)  ckj )  N 


(J 


kj 


2(Tlj 


(5.32) 
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Using  (5.18),  we  obtain 


^Ep  =  -2 (tp  -  Op)  x  Sa(xpi  •  bfc,  xp2  •  bfc, . . . ,  XpMp  -hk)  x^1  1  1  1  1  k 


dcrkj 


El=i  m 


.  eark,(i+(k~l)Mp) 

X  ^  (  yMP  eark,m 

i=l  \  2-jm=  1  ' 

D 


1  +  a[rk,(i+(k-l)Mp)  ~  Sa({rk  ,rn}rn=  l 


d=l,d^j  (»+(fc-l)Mp)/Mp]  ,d^ 


a'p,(i+(fc-l)Mp)[Mp],d 


w  (x(p,(*+(fe-l)Mp)[Mp]j)  ckj)  _ _ (  (j>,(i+(k—l)Mp)[Mp],j)  ckj )  ^ 

X  o  X  „  J 


0" 


kj 


And  the  update  formula  for  akj  is  as  follows 


Acrfcj  =  -77 


where  rj  is  the  same  learning  rate  as  in  (5.31) 


dE 
dakj  ’ 


5.2. 1.2  Update  Rule  For  Consequent  Parameters 


(5.33) 


(5.34) 


The  error  rate  for  the  consequent  parameters  {b*  =  {6q,  ...,  blD},  i  =  1...  |03|}  is 
defined  as 


where, 


dEp  _  (dEp  dEp  8Ep 

W  ~  ' 5^~’  56p'“’  W~ 

dEp  _  8Ep  50(5, i) 
dE  50(5  .q  dE 


)■ 


/or  j  = 


(5.35) 


(5.36) 


dE 

Next,  we  compute  do  —  using  the  previously  defined  chain  rule  in  (5.14),  and  obtain 


y(5,i) 


dEp  _  dEp  506,1 

50(5ji)  506,i  X  505,j 


From  (5.13),  we  have 


And  from  (5.39),  we  have 


dEp 

506,i 


=  -2(tp  -  Op 


50fu 

505ji 


=  1. 


(5.37) 


(5.38) 


(5.39) 
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Continuing  with  the  derivation,  we  have 


<90(5, i)  _  d(wiSa(xpi  ■  b*,  xp2  •  b*, . . . ,  xpMp  •  b*)) 


db% 


d 

db 


Mp 

E 


Bb) 

xpm  ■  b'lexp(axprn  ■  b! 
7m, 


=  Wi 


Mp 

E 


9  b lexp(axpm  •  b*) 


J  V  ,^i  E/T=  i  exp(axph  ■  b*)  J  ^  V  Ylh=i  exp(axph  ■  b*) 

Mp 

E 


=  w, 


Mv 


v  .  K? 
Apm  u 


=1  (  exp{a(xph  •  b*  - 

Mp 

Xpm  •  b'  ^  exp(a(xph  •  b'  -  xpm  •  b l)a(xphj  -  x 


exp{a(xph  ■  b*  -  xpm  •  b' 


h= 1 


/i=i 


(5.40) 


Thus,  the  overall  error  rate  with  respect  to  the  consequent  parameter  6*  is  given  according 
to  (5.16)  as  follows 


N 


dE  _  dEp 

J  p=  1  J 
N 


2  (ip  Op 

p=i 


x  ur 


Mp 

]T — ^ 1 - 

(  Y7h=l  exp(o:(xph  ■  1 1  -  Xpm  ■  b')) 

Mp 

Xpm  ■  b*  exp(a(xph  ■  b*  -  xpm  •  b l)a(xphj  -  x 


Mn 


exp(a(xph  ■  b!  -  xpm,  •  b' 


/i=i 


h=l 


(5.41) 


Hence,  the  update  formula  for  consequent  parameter  bl- 

^  =  -VW,  (5-42) 

where  i]  is  the  same  learning  rate  as  in  (5.31) 

Equations  (5.31),  (5.34)  and  (5.42)  can  be  used  to  update  Ckj ,  <r kj  and  6*  parameters 
either  on-line,  bag  by  bag  (  we  want  to  emphasis  here  that  the  on-line  learning  is  not  achieved 
instance  by  instance,  but  rather  bag  by  bag),  or  off-line  in  batch  mode  after  presentation 
of  the  entire  data  set. 

The  proposed  MI-ANFIS  basic  learning  algorithm  is  summarized  in  Algorithm  5.2. 
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Algorithm  5.2  MI-ANFIS  Basic  Learning  Algorithm 


Inputs:  B:  the  set  of  training  bags. 

T :  the  set  of  training  labels. 

M :  the  number  of  instances  in  each  bag. 
a:  the  constant  used  in  the  “softmax”  function. 

7]:  the  learning  rate. 

Emax '■  number  of  epochs, 
e:  minimum  parameters  change  value. 

Outputs:  b*:  the  sets  of  consequent  parameters. 

cl:  the  set  of  membership  functions’  centers. 
a1:  the  set  of  membership  functions’  widths. 

Initialize  b*,  cl,  and  a1. 

repeat 

Update  b*  using  (5.42)  and  hl(new)  =  }/{old)  _|_ 

Update  c*  using  (5.31)  and  cf^new)  =  cl<'old'>  +  Ac1. 

Update  a1  using  (5.34)  and  al(new)  =  al(old^  +  Act1. 
until  max(\\bi(new'>  — 6i(oW)||,  ||c*(«ew)  _ c*(oW)||;  \\aKnew)  _cri(oW)||)  <e  or  Number  of  epochs  > 
K 

max 

return  b',  c*,  <Jl 


5.3  Illustrative  Example 

After  deriving  the  necessary  learning  algorithms,  we  now  study  an  example  to  show 
the  potential  of  using  MI-ANIFS  to  learn  concepts  from  ambiguously  labeled  data.  For 
this  purpose,  let  us  consider  the  syntectic  dataset  used  in  Section  4.3.1.  For  illustrative 
purposes,  we  only  update  the  premise  parameters  during  the  training  epochs,  and  show 
that  the  MI-ANFIS  Basic  Learning  Algorithm  (Algorithm  5.2)  is  capable  of  identifying 
positive  concepts  as  well  as  their  corresponding  multiple  instance  fuzzy  rules. 

To  initialize  the  premise  parameters,  we  use  the  FCM  [60]  algorithm  to  partition 
the  instances’  space  into  4  clusters1.  We  use  the  clusters’  centers  as  initial  centers  for  the 
Gaussian  MFs,  and  we  initialize  all  standard  deviation  parameters  to  a  default  value  of  0.5. 
We  want  to  emphasize  here  that  the  FCM  setp  is  for  the  purpose  of  initialization  only.  It  is 
used  to  identify  dense  regions  of  the  input  space,  these  region  can  contain  positive  and/or 
negative  instances.  The  learning  phase  will  keep  and  tune  only  regions  that  corresponds  to 
positive  concepts. 

1A  grid  or  manual  partitioning  could  also  be  used 
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The  initial  fuzzy  sets  (MFs)  of  the  rules’  premise  parts,  before  training,  are  displayed 
in  Figure  5.3a.  As  it  can  be  seen,  the  initial  4  clusters  simply  cover  the  4  quadrants  of  the 
2D  instance  space  (if  no  label  information  is  used,  as  in  this  case,  data  would  appear  to 
have  uniform  distribution  (refer  to  Figure  4.5)).  The  updated  parameters  after  20  training 
epochs  are  shown  in  Figure  5.3b,  and  the  learned  fuzzy  sets  after  convergence  are  shown  in 
Figure  5.3c. 


Error  rate 


Figure  5.2:  Root  Mean  Squared  Error  of  100  Epochs  of  MI-ANFIS  training. 
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(a)  Initial  MFs  before  starting  the  training  process. 


MFs  of  Input  dimension  2 


(b)  Input  MFs  during  MI-ANFIS  training  (Epoch  number  20). 


MFs  of  Input  dimension  2 


X 

X 


(c)  Learned  MFs  after  MI-ANFIS  training  completed.  Rules  marked  with  red  crosses  are  considered 
vanished  and  are  pruned.  Remaining  rules  (MI- Rule  2  and  MI- Rule  4)  correctly  describe  the  positive 
concepts  of  the  dataset 


Figure  5.3:  Input  MFs  before,  during,  and  after  MI-ANFIS  training. 
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The  system  has  correctly  identified  the  positive  concepts,  and  at  the  same  time 
identified  irrelevant  rules  (MI-Rule  1  and  MI-Rule  3  marked  with  red  crosses  in  Figure 
5.3c).  These  rules  are  considered  irrelevant  either  because  some  of  its  fuzzy  sets  has  empty 
support  (per  consequence  it  cannot  be  activated),  or  very  narrow  fuzzy  set  support  which 
may  indicate  overfitting  and  will  not  be  general  enough  to  use  during  testing.  After  training, 
it  is  recommended  to  detect  and  prune  such  rules  to  improve  the  MI-ANFIS  testing  efficiency 
(e.g.,  requiring  minimum  support). 

5.4  Preventing  Overfitting:  Rule  Dropout 

Neural  networks  with  large  number  of  parameters  are  susceptible  to  overfitting.  MI- 
ANFIS  is  no  exception,  particularly  when  using  large  number  of  multiple  instance  fuzzy 
rules  and  relatively  small  training  datasets.  In  such  scenario,  some  rules  could  co-adapt  to 
the  training  data  and  degrade  the  network  ability  to  generalize  to  unseen  examples.  In  the 
previous  paragraph  we  suggested  pruning  irrelevant  rules,  in  this  section,  we  present  a  more 
general  technique,  known  as  Dropout,  used  to  prevent  overfitting  and  rules’  co-adaptation. 
Dropout  is  a  new  regularization  method  that  was  introduced  by  Hinton  et  al.  [93]  to  al¬ 
leviate  the  serious  problem  of  overfitting  in  deep  neural  networks.  Over  the  years,  many 
methods  have  been  developed  to  reduce  overfitting,  including  using  a  validation  dataset  to 
stop  the  training  as  soon  as  the  performance  gets  worse,  adding  weight  penalties  using  LI 
and  L2  regularization,  or  artificially  augmenting  the  training  dataset  using  label-preserving 
transformations.  However,  as  noted  by  Hinton  [93],  the  best  way  to  regularize  a  fixed-size 
model  is  to  average  the  predictions  of  all  possible  settings  of  the  parameters  weighted  by 
its  posterior  probability  given  the  training  data.  This  can  be  achieved  by  combining  the 
predictions  of  an  exponential  number  of  models.  Combining  several  models  with  different 
architectures  have  the  advantage  of  better  generalization  and  per  consequence  better  test¬ 
ing  performance.  While  generating  an  ensemble  of  models  is  trivial,  training  them  all  is 
prohibitively  expensive. 

Generally,  Dropout  works  by  setting  to  0  the  output  of  each  node  in  a  given  layer  with 
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probability  1  —  p  (p  typically  equals  0.5),  during  training.  Nodes  that  are  dropped  out 
do  not  contribute  to  the  parameters  updates.  During  testing,  all  nodes  are  used  but  the 
outputs  are  weighted  by  the  probability  p.  Following  this  strategy,  every  time  a  new  train¬ 
ing  example  is  presented,  the  network  samples  and  trains  a  different  architecture.  In  other 
words,  Dropout  trains  an  ensemble  of  networks  ( 2N  networks,  N  being  the  number  of  nodes) 
simultaneously  leading  to  an  important  speedup  in  training  time  as  compared  to  traditional 
ensemble  methods.  Figure  5.4  and  Figure  5.5  illustrate  the  Dropout  model. 


(a)  Standard  Neural  Network  (b)  Network  after  Dropout 

Figure  5.4:  Dropout  neural  network  model.  (A)  is  a  standard  neural  network.  (B)  is  the 
same  network  after  applying  dropout.  Doted  lines  indicate  a  node  that  has  been  dropped, 
(source  [93]) 


(a)  At  training  time  (b)  At  test  time 

Figure  5.5:  Illustration  of  Dropout  application.  (A)  a  node  is  dropped  with  probability  1  —  p 
at  training  time.  (B)  at  test  time  the  node  is  always  present  and  its  outputs  are  weighted 
by  p.  (source  [93]) 


We  propose  to  adopt  Dropout  to  regularize  MI-ANFIS  networks.  Typically,  over¬ 
fitting  occurs  in  MI-ANFIS  networks  when  a  set  of  multiple  instance  rules  co-adapt  to  the 
provided  data  early  during  the  training  process,  making  the  remaining  rules  irrelevant  to 


78 


learn.  Thus,  degrading  the  network’s  generalization  capability.  While  the  Dropout  tech¬ 
nique  could  be  applied  to  MI-ANFIS  as  is  (given  the  inherited  neural  network  nature  of 
the  architecture),  care  should  be  exercised  when  selecting  nodes  to  include  in  the  list  of  the 
randomly  dropped  out  nodes.  MI-ANFIS  nodes  are  different  from  that  of  standard  neu¬ 
ral  networks,  when  grouped  into  a  rule,  they  model  meaning  and  express  linguistic  terms. 
Dropping  few  nodes  from  a  given  rule  could  severely  handicap  the  fuzzy  inference  process. 
Hence,  Dropout  should  be  executed  differently.  In  deep  neural  nets,  Dropout  is  applied  to 
selected  layers  (vertically),  for  MI-ANFIS,  we  propose  to  apply  Dropout  on  a  rule  by  rule 
basis  (i.e.,  horizontally).  Either  the  whole  rule  is  included,  or  the  whole  rule  is  dropped. 
This  can  be  achieved  by  applying  Dropout  to  Layer  5  (see  Figure  5.7),  i.e.,  setting  to  zero 
the  output  of  the  “to  be  dropped  out”  rules.  We  will  refer  to  this  derived  technique  as 
“Rule  Dropout” .  Using  a  Rule  Dropout  strategy  to  train  MI-ANFIS  networks  is  approx- 
imatively  equivalent  to  sampling  and  training  2R  ( R  is  the  number  of  rules)  ensemble  of 
networks. 

Let  p  be  the  probability  with  which  a  rule  is  present,  formally,  Rule  Dropout  is 
applied  to  Layer  5  during  training  as  following 

hi  ~  Bernoulli (p)  (5.43) 

0§ti  =  hiWiSa(xpi  •  b  ,  Xp2  ■  b  , . . .  i'X-pMp  ■  b  ),  (5.44) 

where  hi  is  a  Bernoulli  random  variable  with  probability  p  of  being  1 .  During  testing, 
Layer  5  output  is  scaled  by  p,  i.e.,  =  pwiSa(x.p i  •  b*,  xP2  •  b®, . . .  ,XpMp  ■  b®).  Figure 

5.6  and  Figure  5.7  show  an  illustration  of  an  MI-ANFIS  network  with  3  multiple  instance 
fuzzy  rules  and  implementing  Rule  Dropout. 

Deriving  the  new  update  equations  for  MI-ANFIS  parameters  requires  taking  into 
consideration  the  added  Bernoulli  random  variable,  hi.  It  is  straightforward  to  show  that 
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Laver  5  +  Rule  Dropout 


hiWih 


Figure  5.6:  Rule  Dropout  MI-ANFIS  model. 


the  new  gradients  with  respect  to  premise  and  consequent  parameters  are  given  by 
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Laver  5  +  Rule  Dropout 


Figure  5.7:  Illustration  of  Rule  Dropout  application.  Doted  lines  indicate  a  rule  that  has 
been  dropped. 
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In  a  similar  manner, 
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(5.47) 


As  it  can  be  seen,  equations  (5.45),  (5.46),  and  (5.47)  will  get  zeroed  when  the  rule  is 
dropped  out  (i.e.,  hk  =  0  and  hi  =  0).  Thus,  its  premise  and  consequent  parameters  are 
not  updated. 

In  order  to  demonstrate  the  gain  in  generalization  acquired  by  MI-ANFIS  when 
utilizing  Rule  Dropout,  we  train  an  MI-ANFIS  architecture  with  and  without  Rule  Dropout 
on  a  multiple  instance  dataset  sampled  from  COREL  [94],  The  dataset  classify  whether  an 
image  contains  an  elephant  or  not,  and  consists  of  200  images  (bags):  100  positive  images 
containing  the  target  animal  and  100  negative  images  containing  other  animals.  Each  image 
is  represented  as  a  set  of  patches  (instances)  and  each  patch  is  in  turn  represented  by  230 
features  describing  color,  texture  and  shape  information.  Before  training,  we  applied  PCA 
to  reduce  the  dimensionality  of  the  features  to  10  dimensions  to  speedup  MI-ANFIS.  Table 
5.1  summarizes  the  dataset  characteristics.  Two  MI-ANFIS  networks  composed  of  15  rules 
each,  with  one  network  employing  Rule  Dropout  (with  p  =  0.7),  were  trained  on  90%  of  the 
data,  and  the  remaining  10%  was  used  for  testing  (split  was  done  randomly).  Figure  5.8 
shows  the  training  and  testing  errors  for  both  networks  during  100  epoches.  As  it  can  be 
seen,  without  Rule  Dropout,  starting  at  epoch  20,  testing  performance  begins  to  degrade 
while  training  error  continues  to  decrease.  In  other  words,  overfitting  begins  to  occur.  On 
the  other  hand,  when  using  Rule  Dropout,  overfitting  is  significantly  reduced  and  MI-ANFIS 
achieved  better  testing  performance  at  the  end  of  the  training  phase  (0.1123  testing  SSE 
with  Rule  Dropout  compared  to  0.1451  testing  SSE  without  Rule  Dropout). 
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TABLE  5.1 


Benchmark  data  sets 


Data  set 

dim.(PCA) 

No.  Bags 

Positive 

Negative 

No. Instances 

Elephant 

230(10) 

200 

100 

100 

2  —>13 

Figure  5.8:  Training  and  testing  errors  for  two  MI-ANFIS  networks  with  and  without  Rule 
Dropout. 

5.5  Multi-Class  MI-ANFIS 

The  basic  MI-ANFIS  architecture  illustrated  in  Figure  5.1  computes  one  single  out¬ 
put.  It  is  suitable  for  binary  classification  problems,  and  through  the  use  of  a  sum  of  squared 
error  (SSE)  loss  function,  it  can  effectively  be  used  to  solve  multiple  instance  regression  prob¬ 
lems.  The  extension  of  MI-ANFIS  to  multiple  class  classification  problems  can  be  achieved 
through  the  use  of  a  set  of  multiple  independent  MI-ANFIS  networks  and  using  a  one  versus 
the  rest  training  pattern.  Formally,  for  a  set  of  N  training  bags,  B  =  {Bp  I  P  = 
and  their  corresponding  labels  T  =  {tp  \  p  =  1, . . . ,  N,  tp  6  [1 . . .  T]},  where  T  is  the  number 
of  classes  of  a  given  multiple  class  classification  problem,  we  define  T  different  MI-ANFIS 
networks,  denoted  as  { nett}i  .  To  train  each  network,  bags  are  relabeled  as  positive  for 
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bags  that  belong  to  the  positive  class,  negative  otherwise.  During  testing,  a  new  unseen 
bag  is  fed  through  the  T  networks  and  T  outputs  are  computed,  the  bag  is  then  assigned 
the  class  with  the  highest  output.  While  this  extension  is  straightforward  and  works  with 
an  arbitrary  number  of  classes,  it  requires  an  extensive  amount  of  preprocessing  to  rela¬ 
bel  the  data  and  generate  networks.  Moreover,  doing  so  makes  the  data  unbalanced  and 
sampling  should  be  used  before  training.  In  addition,  some  classes  may  share  the  same 
concepts,  therefore,  training  different  networks  may  lead  to  redundant  rules  being  learned 
and  wasting  CPU  cycles.  Thus,  it  is  better  to  restructure  MI-ANFIS  to  support  multiple 
classes. 

In  the  following,  we  describe  the  extended  Multi-Class  MI-ANFIS  (MCMI-ANFIS),  and 
re-derive  the  necessary  update  equations.  MCMI-ANFIS  employs  multiple  instance  fuzzy 
rules  and  has  similar  architecture  to  MI-ANFIS,  Figure  5.9  is  an  illustration  of  the  extended 
architecture.  Layer  1  through  Layer  5  are  the  same  as  in  MI-ANFIS.  Layer  6  is  a  fully  con¬ 
nected  layer,  it’s  composed  of  T  nodes  that  compute  the  sum  of  all  incoming  signals  as 
following, 

I  1 

06,j  =  ^2  VijhiWiSa(xpi  ■  b\  Xp2  •  b\  . .  . ,  x.pMp  ■  b4).  (5.48) 

i= 1 

where  Vij ,  are  weights  as  in  standard  feedforward  neural  networks.  Layer  7  is  an  additional 
layer  that  computes  the  log-probabilities  of  Layer  6’s  outputs  through  the  application  of 
the  LogSoftmax  function  which  is  given  by 


O7  j  =  log 


exp(06,j 


LEfc=iexp(06)fc)J 


(5.49) 


The  reason  behind  applying  LogSoftmax  is  to  prepare  the  network’s  outputs  to  be 


used  with  a  negative  log  likelihood  criterion  that  is  typically  used  to  train  classification 
problems  with  multiple  classes.  Given  this  criterion,  the  loss  function  of  MCMI-ANFIS,  for 
a  given  bag  Bp  with  class  tp  (tp  is  a  class  index  in  [1 . . .  T]),  is  defined  by 


Ep  -  -07ttp 


(5.50) 


84 


Laver  2  Laver  3  Laver  4  Laver  5  +  Rule  Dropout 


Figure  5.9:  Multi-Class  MI-ANFIS  with  R  rules  and  T  classes  (outputs). 


We  now  apply  the  chain  rule  (5.14)  and  derive  MCMI-ANFIS  update  equations. 
First,  we  compute  the  gradients  with  respect  to  the  premise  parameters  using  (5.17)  and 
(5.18),  we  have 
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(5.51) 


It  is  straightforward  to  show  that 
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Thus,  the  update  equations  for  the  premise  parameters  are  as  following 
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1  +  a^rfei(j+(fe_!)Mp)  -  5a({rfe,m}^l1)^ 


(5.53) 
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(5.54) 


Similarly  the  update  equation  for  the  consequent  parameters  is  given  by 
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(5.55) 


Finally,  the  gradients  with  respect  to  the  fully  connected  layer  weights,  vkt,  is  given 


by 
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exp(06,t 
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(5.56) 


Equations  (5.53),  (5.54),  (5.55),  and  (5.56)  can  be  used  to  train  MCMI-ANFIS  either 
on-line  (e.g.,  using  stochastic  gradient  descent),  or  off-line  in  batch  mode. 


5.6  Complexity  Analysis 

We  now  study  the  asymptotic  complexity  for  the  execution  of  the  proposed  MI- 
ANFIS  algorithms  in  term  of  four  parameters:  the  number  of  training  bags  N,  the  average 
number  of  instances  per  bag  M,  the  dimensionality  of  instances  D ,  and  the  number  of 
the  multiple  instance  rules  R.  MI-ANFIS  performs  two  passes:  (1)  a  forward  pass  to 
compute  the  network  output,  as  illustrated  in  Algorithm  5.1,  and  (2)  a  backward  pass  to 
backpropagate  the  gradients  and  update  the  parameters,  as  illustrated  in  Algorithm  5.2. 
First,  for  the  forward  pass,  we  perform  the  following  sequential  operations  for  each  training 
bag: 

1.  Fuzzification  of  inputs:  M  x  D  x  R  operations. 

2.  Evaluation  of  truth  instances:  M  x  R  operations. 

3.  Computation  of  the  rules  activation  degrees:  R  operations. 

4.  Normalization  of  the  activation  degrees:  R  operations. 

5.  Evaluation  of  Layer  5  outputs:  M  x  D  x  R  operations. 

6.  Evaluation  of  the  overall  output:  1  operation. 

Thus,  the  total  number  of  operations  during  the  forward  pass  for  a  given  bag  is 
asymptotically  given  by  R  x  [2 M  x  D  +  M  +  2]  -(- 1  —  0(M  x  D  x  R).  For  MCMI-ANFIS 
networks  we  need  to  take  into  considerations  the  two  additional  layers  which  contribute  an 
additional  T  x  R  +  T  operations  (T  being  the  number  of  classes).  In  the  backward  pass,  for 
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each  training  bag  we  compute  the  gradients  with  respect  to  the  premise  and  consequents 
parameters.  The  number  of  operations  required  to  compute  each  gradient  is: 

1.  For  MF  centers  (equation  (5.17)):  D  x  R  x  (1  +  2 M  x  D  +  R). 

2.  For  MF  standard  deviations  (equation  (5.18)):  D  x  R  x  (1  +  2 M  x  D  +  R). 

3.  For  the  consequent  parameters  (equation  (5.42)):  D  x  Rx  [1  +  M  x  ( M  x  D)]. 

4.  For  MCMI-ANFIS  networks,  there  are  R  x  T  x  (Af  xfl  +  T)  additional  operations 
needed  to  compute  the  gradients  with  respect  to  the  fully  connected  layer  weights. 

Therefore,  the  backward  step  performs  approximatively  3DR+4M RD2 +2DR2 +D2 M2 R  ±2 
0(DR2  +  D2M2R )  for  MI-ANFIS  networks,  and  3 DR  +  4MRD2  +  2 DR2  +  D2M2R  + 
RDMT  +  RT2  ~  0{DR2  +  D2M2R  +  RT2)  for  MCMI-ANFIS  networks.  The  overall 
asymptotic  running  time  for  a  given  training  dataset  with  N  bags  is  dominated  by  the 
backward  pass  and  is  equal  to  0(NDR2  +  ND2M2R),  and  0(N DR2  +  N D2 M2 R  +  N RT2) 
for  MCMI-ANFIS. 

For  problems  with  large  number  of  training  bags,  relatively  small  number  of  rules,  low 
dimensionality  features,  and  constant  number  of  instances,  the  big-0  running  time  of  the 
network  is  linear  in  terms  of  N,  i.e.,  O(N). 

5.7  Discussion 

MI-ANFIS  deals  with  ambiguity  by  introducing  the  novel  concept  of  truth  instances: 
when  carrying  reasoning  using  a  bag  of  instances  at  Layer  2  (Figure  5.1),  a  proposition  will 
not  only  have  one  degree  of  truth,  it  will  have  multiple  degrees  of  truth  we  call  truth 
instances.  Thus,  effectively  encoding  the  third  vagueness  component  of  ambiguity  and  in¬ 
creasing  the  expressive  power  of  traditional  fuzzy  logic. 

Learning  positive  concepts  from  ambiguously  labeled  data  has  been  the  core  task  of  various 
MIL  algorithms  (e.g.  Diverse  Density  [39]).  MI-ANFIS  has  proven  that  it  can  learn  positive 
concepts  effectively  while  jointly  providing  a  fuzzy  representation  of  such  regions.  The  fuzzy 


representation  is  combined  into  meaningful  and  simple  multiple  instance  rules  that  can  be 
easily  visualized  and  interpreted. 

Compared  to  previously  proposed  multiple  instance  neural  networks,  such  as  Multiple  In¬ 
stance  Neural  Networks  [63]  (MI-NN)  and  Multiple  Instance  RBF  Neural  Networks  [95] 
(RBF-MIP),  MI-ANFIS  advantage  is  the  use  of  multiple  instance  fuzzy  logic  to  learn  a 
fuzzy  representation  of  true  positive  concepts.  MI-NN  only  learns  standard  neural  network 
weights  that  do  not  carry  any  information  regarding  concepts.  On  the  other  hand,  while 
standard  RBF  neural  networks  have  been  shown  to  be  equivalent  to  zero  order  traditional 
Sugeno  systems  under  certain  constraints  [96],  thus,  capable  of  learning  a  fuzzy  representa¬ 
tion  of  the  inputs,  RBF-MIP  networks  have  different  architecture  and  they  do  not  employ 
adaptive  radial  basis  functions  in  the  first  layer.  Instead,  they  represent  the  inputs  by  com¬ 
puting  their  distances  to  clusters  of  training  bags.  This  later  method  is  expensive  and  its 
success  greatly  depends  on  the  quality  of  the  training  data  as  it  takes  into  consideration  all 
the  training  examples  which  my  include  wrongly  (nosily)  labeled  bags.  It  does  not  lead  to 
learning  true  positive  concepts,  only  learning  other  discriminative  regions  of  the  bags  space. 
Moreover,  MI-ANFIS  learning  algorithms  can  be  updated  to  support  a  wide  range  of  loss 
functions  (criterions)  such  as  cross  entropy  [97],  maximum  margin  [98],  etc.  MI-NN  is  de¬ 
signed  to  use  a  handcrafted  loss  function  (see  section  2.1.4)  which  is  largely  responsible  for 
the  multiple  instance  behavior  of  the  system  and  cannot  be  changed  without  substantially 
changing  the  architecture  of  MI-NN.  This  could  be  disadvantageous  if  MI-NN  is  to  be  used 
to  solve  multiple  instance  -  multiple  class  classification  problems. 

Finally,  when  compared  to  our  proposed  MI-Mamdani  system,  MI-ANFIS  is  fully  indepen¬ 
dent.  MI-Mamdani  does  require  positive  concepts  to  be  learned  using  a  different  algorithm 
(e.g.  FCMI),  or  based  on  intuition.  MI-ANFIS  does  not  rely  on  any  traditional  MIL  algo¬ 
rithms  and  can  learn  its  rule  base  from  data. 
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CHAPTER  6 


EXPERIMENTAL  RESULTS 

In  this  chapter,  we  provide  a  quantitative  evaluation  of  the  proposed  farmework  by 
applying  it  to  benchmark  datasets  commonly  used  to  evaluate  MIL  methods.  First,  we  apply 
MI-MAMDANI  and  MI-ANFIS  to  the  MUSK  [37],  Fox,  Tiger,  and  Elephant  datasets  [94], 
Then,  we  apply  our  muliple  class  MI-ANFIS  (MCMI-ANIFS)  to  solve  a  20  class  classification 
problem  derived  from  the  COREL  dataset  [94].  The  datasets  are  described  as  following. 

6.1  Benchmark  Datasets 

•  The  MUSK  Dataset: 

The  MUSK  dataset  is  the  most  commonly  used  data  in  the  context  of  MIL.  This  MIL 
problem  is  a  case  of  polymorphism  ambiguity.  The  goal  is  to  classify  molecules  by 
looking  at  their  shapes.  Each  molecule  can  appear  in  several  distinct  shapes  because 
of  binding  and  twisting  that  might  occur  during  interactions.  Thus,  a  molecule  can 
have  different  forms  of  expression.  The  objective  is  to  classify  whether  a  molecule 
smells  musky  [99].  To  solve  this  problem  using  standard  single  instance  learning,  we 
first  need  to  identify  which  form  is  responsible  for  the  molecule  behaviour.  However, 
this  process  is  tedious.  Hence,  the  problem  is  better  represented  as  a  multiple  instance 
problem.  Two  versions  of  the  dataset  were  released,  MUSK1  and  MUSK2.  In  both 
datasets,  each  bag  represents  a  molecule  and  instances  within  each  bag  represent  the 
different  low-energy  conformations  of  the  molecule.  Each  instance  is  characterized  by 
166  features.  MUSK1  has  92  bags,  of  which  47  are  positive,  and  MUSK2  has  102 
bags,  of  which  39  are  positive. 

•  Fox,  Tiger,  and  Elephant  Datasets: 
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These  datasets  classify  whether  an  image  contains  the  corresponding  animal.  Each 
dataset  consists  of  200  images  (bags):  100  positive  images  containing  the  target  animal 
and  100  negative  images  containing  other  animals.  Each  image  is  represented  as  a  set 
of  patches  (instances)  and  each  patch  is  in  turn  represented  by  230  features  describing 
color,  texture,  and  shape  information  [94]. 

•  The  COREL  Dataset: 

To  evaluate  our  proposed  multi-class  MCMI-ANFIS  algorithm,  we  use  it  to  categorize 
images  from  the  COREL  image  dataset.  In  particular,  we  use  the  Corel-1000  and 
Corel-2000.  Corel-1000  has  1000  images  that  cover  10  categories  and  Corel-2000  has 
2000  images  with  20  categories,  respectively.  Each  category  has  100  images  and  each 
image  is  represented  by  a  bag  consisting  of  instances  obtained  via  extracting  features 
from  segmented  regions  of  the  images.  Each  instance  is  a  9-D  feature  vector  charac¬ 
terizing  the  color,  texture,  and  shape  properties  of  a  segmented  region.  For  the  sake 
of  fair  comparison  we  adopted  the  same  data  settings  and  image  segmentation  algo¬ 
rithm  used  in  previous  state  of  the  art  work  [43].  Figure  6.1  shows  images  randomly 
sampled  from  the  20  categories  and  the  corresponding  segmentation  results. 

Table  6.1  summarizes  the  characteristics  of  the  MUSK,  Fox,  Tiger,  and  Elephant 
Datasets.  Table  6.2  describes  the  categories  of  the  COREL  datasets  and  the  corresponding 
number  of  instances.  We  should  note  that  for  each  dataset,  the  bags  have  a  variable  number 
of  instances.  For  instance,  for  MUSK1  data,  the  number  of  instances  per  bag  varies  from 
2  to  40.  To  reduce  the  dimensionality  of  the  features  in  order  to  speedup  MI-FIS  training 
and  increase  the  interpretability  of  the  generated  multiple  instance  fuzzy  rules,  we  apply 
the  PCA.  In  Table  6.1  we  show  both  the  original  and  reduced  dimensions  for  each  dataset. 
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Figure  6.1:  Images  randomly  sampled  from  20  categories  and  the  corresponding  segmenta¬ 
tion  results.  Segmented  regions  are  shown  in  their  representative  colors  (source  [43]). 


TABLE  6.1 


MUSK,  Fox,  Tiger,  and  Elephant  Datasets 


Dataset 

dim.(PCA) 

No.  Bags 

Positive 

Negative 

No.  Instances  -  Avg  -  Median 

MUSK1 

166(25) 

92 

47 

45 

2  -4  40  -  5.17  -  4 

MUSK2 

166(25) 

102 

39 

63 

1  -4  1044  -  64.69  -  12 

Fox 

230(10) 

200 

100 

100 

2  -4  13  -  6.47-  6 

Tiger 

230(10) 

200 

100 

100 

1  -4  13  -  5.44  -  6 

Elephant 

230(10) 

200 

100 

100 

2  -4  13  -  7.62  -  7 

6.2  Evaluation  of  MI-MAMDANI  and  MI-ANFIS  algorithms 


For  all  experiments,  to  learn  an  MI-Mamdani  system  from  the  training  data,  first, 
as  outlined  in  Section  4.3,  we  apply  the  FCMI  to  extract  concept  points.  Next,  we  generate 
multiple  instance  fuzzy  rules  from  concept  points.  Finally,  the  learned  rules  are  combined 
into  an  MI-Mamdani  multiple  instance  fuzzy  inference  system.  We  also  construct  zero-order 
MI-ANFIS  systems  with  various  number  of  multiple  instance  rules.  We  use  Gaussian  MFs 
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TABLE  6.2 


20  Image  Categories  of  the  COREL  dataset  and  the  Corresponding  Average  Number  of 
Instances  (regions) 


Category  ID 

Category  Name 

Instances  per  Image 

1 

African  people  and  villages 

4.84 

2 

Beach 

3.54 

3 

Historical  building 

3.1 

4 

Buses 

7.59 

5 

Dinosaurs 

2.00 

6 

Elephants 

3.02 

7 

Flowers 

4.46 

8 

Horses 

3.89 

9 

Mountains  and  glaciers 

3.38 

10 

Food 

7.24 

11 

Dogs 

3.80 

12 

Lizards 

2.80 

13 

Fashion  models 

5.19 

14 

Sunset  scenes 

3.52 

15 

Cars 

4.93 

16 

Waterfalls 

2.56 

17 

Antique  furniture 

2.30 

18 

Battle  ships 

4.32 

19 

Skiing 

3.34 

20 

Desserts 

3.65 

to  describe  the  input  fuzzy  sets.  For  initialization,  we  use  the  FCM  [60]  algorithm  to  cluster 
the  instances  of  the  positive  bags  into  a  prefixed  number  of  clusters,  and  we  initialize  MFs’ 
centers  as  the  clusters  centers.  Table  6.3  summarizes  all  parameters  used  in  training  the 
MI-ANFIS.  We  note  that  the  reason  behind  using  large  standard  deviations  For  MUSK1, 
MUSK2  datasets  is  to  allow  the  initial  rules  to  cover  the  entirety  of  the  input  space. 

To  illustrate  the  advantage  of  using  MI-ANFIS  over  the  traditional  ANFIS,  we  com¬ 
pare  these  two  algorithms  on  the  first  five  datasets.  Since  ANFIS  cannot  learn  from  ambigu¬ 
ously  labeled  data,  for  the  sake  of  comparison,  we  consider  the  naive  MIL  assumption  where 
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TABLE  6.3 


MI-ANFIS  Training  Parameters 


Parameter 

MUSK1 

MUSK2 

Fox 

Tiger 

Elephant 

No.  of  MI  Rules 

6 

3 

2 

4 

3 

No.  of  Inputs 

25 

25 

10 

10 

10 

MF’s  a 

100 

100 

10 

10 

10 

Output  parameters 

Is 

Is 

Is 

Is 

Is 

Softnrax’s  a 

1 

1 

1 

1 

1 

Learning  rate 

0.1 

0.1 

0.1 

0.1 

0.1 

TABLE  6.4 

Comparison  of  MI-ANFIS  prediction  accuracy  (in  percent)  to  Naive-ANFIS  on  the  bench¬ 
mark  data  sets  (averaged  over  five  runs) 


Algorithms 

MUSK1 

MUSK2 

Fox 

Tiger 

Elephant 

MI-ANFIS 

93.49 

90.58 

66.4 

84.5 

86.97 

±0.76 

±1.31 

±2.77 

±0.61 

±1.10 

Naive-ANFIS 

67.82 

79.43 

58.70 

77.70 

82.2 

±4.04 

±5.04 

±1.35 

±0.83 

±0.83 

all  instances  from  positive  bags  are  considered  positive  and  all  instances  from  negative  bags 
are  considered  negative.  We  refer  to  this  implementation  as  Naive-ANFIS.  The  results  are 
summarized  in  Table  6.4  where  the  performance  is  reported  in  terms  of  prediction  accuracy 
averaged  over  all  10  cross  validation  sets  (%  of  correct  ±  standard  deviation).  As  it  can 
be  seen,  MI-ANFIS  outperforms  Naive-ANFIS  significantly.  This  is  because  inaccurately 
labeled  instances  within  the  positive  bags  were  used  for  training  the  Naive-ANFIS. 

Table  6.5  shows  the  performance  of  the  proposed  algorithms  as  compared  to  state  of 
art  MIL  algorithms  on  the  first  five  benchmark  datasets.  MI-MAMDANI  and  MI-ANFIS 
were  trained  and  tested  using  ten  fold  cross  validation.  Table  6.6  summarizes  the  average 
running  time  of  cross  validation  of  MI-ANFIS  as  compared  to  other  algorithms  on  the 
benchmark  datasets. 

Overall,  MI-ANFIS  is  comparable  to  other  MIL  algorithms.  In  fact,  on  all  tested 
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TABLE  6.5 


Comparison  of  MI-ANFIS  prediction  accuracy  (in  percent)  to  other  methods  on  the  bench¬ 
mark  data  sets.  Results  for  3  top  performing  methods  are  shown  in  bold  font.  We  use 
reported  results,  N/A  indicated  that  a  given  algorithm  was  not  applied  to  that  dataset 


Algorithms 

MUSK1 

MUSK2 

Fox 

Tiger 

Elephant 

MI-ANFIS 

93.49 

90.58 

66.4 

84.5 

86.97 

±0.76 

±1.31 

±2.77 

±0.61 

±1.10 

MILES  [100] 

86.3 

87.7 

N/A 

N/A 

N/A 

APR  [37] 

92.4 

89.2 

N/A 

N/A 

N/A 

DD  [39] 

88.9 

82.5 

N/A 

N/A 

N/A 

DD-SVM  [101] 

85.8 

91.3 

N/A 

N/A 

N/A 

EM-DD  [58] 

84.8 

84.9 

56.1 

72.1 

78.3 

Citation-KNN  [67] 

92.4 

86.3 

N/A 

N/A 

N/A 

MI-SVM  [94] 

77.9 

84.3 

57.8 

84.0 

81.4 

rni-SVM  [94] 

87.4 

83.6 

58.2 

78.4 

82.2 

MI-NN  [102] 

88.0 

82.0 

N/A 

N/A 

N/A 

Bagging-APR  [103] 

92.8 

93.1 

N/A 

N/A 

N/A 

RBF-MIP  [95] 

91.3 

90.1 

N/A 

N/A 

N/A 

±1.6 

±1.7 

BP-MIP  [63] 

83.7 

80.4 

N/A 

N/A 

N/A 

RBF-Bag-Unit  [104] 

90.3 

86.6 

N/A 

N/A 

N/A 

Mi-kernel  [105] 

88.0 

89.3 

60.3 

84.2 

84.3 

PPPM-kernel  [106] 

95.6 

81.2 

60.3 

80.2 

82.4 

MiGraph  [105] 

90.0 

90.0 

61.2 

81.9 

85.1 

miGraph  [105] 

88.9 

90.3 

61.6 

86.0 

86.8 

ALP-SVM  [107] 

86.3 

86.2 

66.0 

86.0 

83.5 

MIForest  [108] 

85.0 

82.0 

64.0 

82.0 

84.0 

MI-MAMDANI 

88.33  ±1.67 

74.0±3.2 

65.4  ±1.1 

79.9  ±1.6 

79.5  ±1.5 

datasets,  MI-ANFIS  ranked  consistently  among  the  top  three.  For  MUSK1,  PPPM-kernel 
[106]  performed  the  best  (95.6%),  but  this  algorithm  did  not  perform  as  well  for  the  other 
sets.  For  MUSK2  Bagging- APR  [103]  achieved  the  best  accuracy,  as  reported  by  [100]. 
MI-ANFIS  achieved  the  best  average  performance  for  the  Fox  and  Elephant  datasets,  and 
second  best  performance  after  the  miGraph  [105]  and  ALP-SVM  [107]  methods  for  the  Tiger 
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TABLE  6.6 


Comparison  of  MI-ANFIS  running  time  (in  Minutes)  to  other  methods  on  the  benchmark 
data  sets. 


Algorithms 

MUSK1 

MUSK2 

Fox 

Tiger 

Elephant 

MI-ANFIS 

1.1 

8 

6 

5.5 

0.5 

MILES  [100] 

29.1 

130.2 

N/A 

N/A 

N/A 

DD  [39] 

2.85 

32 

N/A 

N/A 

N/A 

DD-SVM  [101] 

612 

1740 

N/A 

N/A 

N/A 

EM-DD  [58] 

3.75 

15.5 

3.3 

14.36 

5 

Citation-KNN  [67] 

0.01 

2.57 

N/A 

N/A 

N/A 

MI-SVM  [94] 

0.5 

5.3 

0.28 

0.21 

2.43 

dataset.  On  the  other  hand,  MI-MAMDANI  performed  better  than  10  algorithms  out  of  19 
tested  on  MUSK1,  it  also  showed  better  performance  than  7  algorithms  ont  of  9  algorithms 
tested  on  FOX.  However,  MI-MAMDANI  did  not  exhibit  consistent  performance  on  the  rest 
of  the  benchmark  datasets.  MI-MAMDANI  systems  are  constructed  based  on  transforming 
concept  points  extracted  using  FCMI  (or  other  MIL  methods)  into  multiple  instance  fuzzy 
rules.  In  scenarios  where  bags  have  large  number  of  instances  (such  as  MUSK2),  this 
handcrafted  method  does  lead  to  accurate  fuzzy  representation  of  concepts,  but  further  fine 
tuning  should  be  used  to  improve  the  generated  rules’  consequent  parts. 

6.3  MCMI-ANFIS 

Using  the  COREL  dataset  we  train  an  MCMI-ANFIS  to  solve  the  problem  of  region- 
based  image  categorization.  We  adopted  the  same  training  and  testing  settings  as  other 
state  of  the  art  algorithms:  images  within  each  class  were  randomly  split  equally  into  a 
training  set  and  a  testing  set.  In  the  following,  we  report  average  results  of  five  runs. 

For  both  Corel-1000  and  Corel-2000  experiments  we  construct  a  first  order  MCMI- 
ANFIS  with  60  multiple  instance  fuzzy  rules  and  employing  Rule  Dropout.  Table  6.7 
summarizes  MCMI-ANFIS  properties.  The  system  is  then  trained  for  2000  epoches,  Table 
6.8  and  Table  6.9  report  the  confusion  matrices  of  the  two  experiments. 
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TABLE  6.7 


MCMI-ANFIS  Training  Parameters 


Parameter 

Value 

No.  of  MI  Rules 

60 

No.  of  Inputs 

9 

MF’s  a 

10 

Rule  Dropout  Rate 

0.2 

Softmax’s  a 

1 

Learning  rate 

0.1 

Analysis  of  the  confusion  matrix  of  the  Corel- 1000  experiment  shows  that  the  largest  classi¬ 
fication  error  occured  between  category  2  (Beach)  and  category  9  (Mountains  and  glaciers): 
18.4%  of  Mountains  and  glaciers  images  were  classified  as  beaches  and  16.7%  of  Beach  images 
were  confused  as  Mountains  and  glaciers.  African  people  and  villages  category  exhibited 
the  lowest  performance,  65.9%.  These  observations  are  inline  with  pervious  work  [43],  the 
large  classification  errors  are  due  to  the  semantic  richness  of  these  categories  as  they  contain 
multiple  concepts  that  are  similar  to  other  categories.  Analysing  the  confusion  matrix  of  the 
Corel-2000  experiment  reveals  similar  confusions  as  the  Corel-1000,  in  addition  10%  of  the 
Desserts  category  images  were  confused  with  Beach  and  20.9%  of  Mountains  and  glaciers 
images  were  misclassified  as  Waterfalls.  Even  though  these  categories  are  visually  similar, 
the  classification  accuracy  can  be  improved  through  the  use  of  more  distinctive  features. 
However,  for  fairness  of  comparison  we  used  the  same  feature  set  as  previous  art. 
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TABLE  6.8 

Confusion  matrix  of  MCMI-ANFIS  on  the  region-based  image  categorization  experiments 
using  Corel-1000  Dataset  (showing  the  run  with  the  best  overall  accuracy,  83.8%).  Each 
row  shows  the  percentage  of  images  in  one  category  classified  to  each  of  the  10  categories. 


Cat. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

65.9 

4.9 

4.9 

0.0 

2.4 

12.2 

2.4 

2.4 

0.0 

4.9 

2 

4.2 

66.7 

0.0 

4.2 

2.1 

2.1 

0.0 

2.1 

16.7 

2.1 

3 

5.2 

10.3 

81.0 

0.0 

0.0 

1.7 

0.0 

0.0 

1.7 

0.0 

4 

0.0 

3.6 

3.6 

89.1 

0.0 

0.0 

0.0 

0.0 

3.6 

0.0 

5 

0.0 

0.0 

0.0 

0.0 

92.9 

3.6 

0.0 

0.0 

3.6 

0.0 

6 

0.0 

0.0 

2.3 

0.0 

0.0 

86.4 

0.0 

0.0 

11.4 

0.0 

7 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

97.8 

0.0 

0.0 

0.0 

8 

3.6 

0.0 

0.0 

0.0 

0.0 

7.3 

0.0 

85.5 

0.0 

3.6 

9 

2.0 

18.4 

0.0 

0.0 

0.0 

2.0 

0.0 

0.0 

77.6 

0.0 

10 

2.0 

2.0 

2.0 

0.0 

0.0 

0.0 

0.0 

2.0 

0.0 

91.8 
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TABLE  6.9 


Confusion  matrix  of  MCMI-ANFIS  on  the  region-based  image  categorization  experiments  using  Corel-2000  Dataset  (showing  the  run 
with  the  best  overall  accuracy,  70.1%).  Each  row  shows  the  percentage  of  images  in  one  category  classified  to  each  of  the  20  categories. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

1 

59.6 

0.0 

4.3 

0.0 

0.0 

2.1 

4.3 

0.0 

0.0 

0.0 

6.4 

2.1 

2.1 

4.3 

4.3 

2.1 

0.0 

0.0 

0.0 

8.5 

2 

2.0 

50.0 

2.0 

0.0 

0.0 

0.0 

0.0 

0.0 

24.0 

2.0 

2.0 

0.0 

0.0 

2.0 

2.0 

0.0 

0.0 

4.0 

4.0 

6.0 

3 

4.2 

4.2 

70.8 

0.0 

0.0 

2.1 

0.0 

0.0 

2.1 

0.0 

2.1 

2.1 

0.0 

0.0 

0.0 

0.0 

2.1 

4.2 

2.1 

4.2 

4 

0.0 

3.4 

5.1 

83.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

1.7 

0.0 

0.0 

3.4 

3.4 

0.0 

5 

0.0 

0.0 

0.0 

0.0 

86.8 

1.9 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

1.9 

0.0 

0.0 

0.0 

3.8 

0.0 

0.0 

5.7 

6 

5.7 

3.8 

0.0 

0.0 

1.9 

64.2 

0.0 

0.0 

7.5 

0.0 

1.9 

0.0 

0.0 

0.0 

0.0 

5.7 

1.9 

0.0 

3.8 

3.8 

7 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

88.1 

0.0 

0.0 

0.0 

2.4 

0.0 

0.0 

4.8 

2.4 

0.0 

0.0 

0.0 

2.4 

0.0 

8 

1.7 

0.0 

0.0 

0.0 

0.0 

1.7 

0.0 

81.4 

0.0 

0.0 

10.2 

5.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

9 

0.0 

4.7 

0.0 

0.0 

0.0 

7.0 

0.0 

0.0 

46.5 

0.0 

2.3 

0.0 

0.0 

0.0 

2.3 

20.9 

0.0 

2.3 

9.3 

4.7 

10 

3.8 

1.9 

0.0 

1.9 

0.0 

0.0 

1.9 

0.0 

0.0 

77.4 

0.0 

3.8 

1.9 

1.9 

3.8 

0.0 

0.0 

1.9 

0.0 

0.0 

11 

6.8 

0.0 

2.3 

0.0 

0.0 

0.0 

2.3 

4.5 

0.0 

2.3 

63.6 

4.5 

4.5 

2.3 

0.0 

0.0 

0.0 

0.0 

4.5 

2.3 

12 

3.9 

2.0 

7.8 

0.0 

0.0 

0.0 

0.0 

2.0 

0.0 

0.0 

2.0 

72.5 

0.0 

0.0 

2.0 

0.0 

0.0 

0.0 

3.9 

3.9 

13 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.1 

0.0 

8.5 

0.0 

76.6 

0.0 

0.0 

0.0 

2.1 

0.0 

4.3 

6.4 

14 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

8.9 

0.0 

0.0 

1.8 

0.0 

0.0 

0.0 

66.1 

16.1 

0.0 

0.0 

0.0 

1.8 

5.4 

15 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

3.3 

0.0 

10.0 

1.7 

10.0 

61.7 

0.0 

5.0 

3.3 

0.0 

5.0 

16 

2.0 

2.0 

2.0 

0.0 

0.0 

7.8 

0.0 

0.0 

13.7 

0.0 

0.0 

2.0 

0.0 

0.0 

0.0 

70.6 

0.0 

0.0 

0.0 

0.0 

17 

3.8 

0.0 

1.9 

0.0 

0.0 

1.9 

0.0 

0.0 

0.0 

0.0 

1.9 

0.0 

1.9 

0.0 

0.0 

0.0 

83.0 

0.0 

1.9 

3.8 

18 

0.0 

4.1 

2.0 

2.0 

0.0 

0.0 

0.0 

0.0 

6.1 

0.0 

0.0 

0.0 

0.0 

2.0 

0.0 

0.0 

0.0 

77.6 

2.0 

4.1 

19 

0.0 

9.6 

0.0 

0.0 

5.8 

0.0 

0.0 

0.0 

9.6 

0.0 

1.9 

0.0 

1.9 

0.0 

0.0 

1.9 

13.5 

1.9 

53.8 

0.0 

20 

0.0 

10.0 

0.0 

0.0 

0.0 

3.3 

0.0 

0.0 

6.7 

3.3 

3.3 

0.0 

0.0 

0.0 

0.0 

3.3 

0.0 

6.7 

3.3 

60.0 

TABLE  6.10 


Comparison  of  MCMI-ANFIS  classification  accuracy  (in  percent)  to  other  methods  on  the 
Corel-1000  and  Corel-2000  benchmark  datasets 


Algorithms 

Corel- 1000 

Corel-2000 

MCMI-ANFIS 

82.1  ±1.5 

69.7  ±0.4 

MIGraph  [105] 

83.9 

72.1 

miGraph  [105] 

82.4 

70.5 

MI-Kernel  [105] 

81.8 

72.0 

MILES  [43] 

82.6 

68.7 

MI-SVM  [94] 

74.7 

54.6 

DD-SVM  [101] 

81.5 

67.5 

Kmeans-SVM  [43] 

69.8 

52.3 

Table  6.10  reports  the  classification  accuracy  averaged  over  five  runs  (%  of  correct 
±  standard  deviation).  Overall  MCMI-ANFIS  showed  consistent  performances  on  both 
datasets  and  achieved  competitive  results  compared  to  other  MIL  methods  reported  in 
the  literature.  When  compared  to  the  top  performing  method  MIGraph  [105],  MCMI- 
ANFIS  showed  comparable  results.  In  addition,  MIGraph,  and  most  other  methods,  were 
trained  and  tested  using  one  versus  all  training  pattern,  whereas  MCMI-ANFIS  learned 
all  the  concepts  in  one  training  pass,  which  is  usually  a  more  difficult  task.  Also  MCMI- 
ANFIS  performance  was  better  than  MILES  [43],  which  was  considered  the  state  of  the  art 
algorithm  on  the  Corel  dataset  until  MIGraph  and  MI-Kernel  were  published.  It  is  worth 
noting  that  on  the  binary  classification  problems  of  the  previous  section  MI-ANFIS  was 
better  than  MIGraph  on  4  out  of  the  5  datasets.  In  general  MI-ANIFS  and  MCMI-ANFIS 
showed  competitive  and  consistent  results  on  all  benchmark  datasets. 

We  note  that  Rule  Dropout  was  necessary  to  train  MCMI-ANFIS.  Without  Rule  Dropout 
we  observed  overfitting,  which  led  to  a  low  44%  accuracy  on  the  Corel-2000  dataset.  This 
emphasizes  the  importance  of  regularization  and  the  need  for  large  datasets  to  train  neural 

networks  in  general.  For  the  Corel  experiments,  our  the  MCMI-ANFIS  has  2880  parameters1 
1 9  x  2  x  60  premise  parameters,  10  x  60  consequent  parameters,  and  60  x  20  fully  connected  layer  parameters 
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to  be  learned,  versus  only  2000  training  bags,  making  overfitting  more  likely  to  occur.  Rule 
Dropout  helped  reducing  this  artifact  significantly,  leading  to  competitive  performance. 
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CHAPTER  7 


APPLICATION  :  LANDMINE  DETECTION  USING  GROUND  PENETRATING 

RADAR 

In  this  Chapter,  we  apply  the  proposed  multiple  instance  fuzzy  inference  framework 
to  fuse  multiple  landmine  detection  algorithms.  First,  we  start  with  an  overview  of  the 
landmine  detection  problem  and  illustrate  the  need  to  solve  this  problem  using  multiple 
instance  learning.  Then,  we  describe  the  dataset  used  in  the  experiments.  Next,  we  show 
how  the  fusion  problem  can  be  solved  using  traditional  Mamdani  and  ANFIS  inference. 
Finally,  we  develop  fusion  methods  using  our  multiple  instance  fuzzy  inference  systems  and 
report  the  results. 

7.1  Landmine  Detection 

Detection  and  removal  of  landmines  is  a  serious  problem  affecting  human  beings 
worldwide.  The  world  is  now  littered  with  an  estimated  200-215  million  landmines  in 
91  countries,  which  maim  or  kill  an  estimated  500  people  every  week,  mostly  innocent 
civilians  [109].  The  task  of  detection  of  buried  landmines  is  of  extreme  difficulty  and  this 
is  mainly  due  to  the  large  variety  of  landmine  types,  different  soil  type  and  compaction, 
temperature,  moisture,  shadow,  time  of  day,  weather  conditions,  and  varying  terrain,  to 
name  a  few. 

Varieties  of  sensors  have  been  proposed  or  are  under  investigation  for  landmine 
detection.  The  research  problem  for  sensor  data  analysis  is  to  determine  how  well  signatures 
of  landmines  can  be  characterized  and  distinguished  from  other  objects  under  the  ground 
using  returns  from  one  or  more  sensors.  Recently,  various  discrimination  algorithms  [110— 
114]  have  been  proposed  for  detecting  buried  objects  using  ground-penetrating  radar  (GPR) 
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Figure  7.1:  3-dimensional  and  2-dimensional  raw  GPR  data. 


[115, 116].  GPR  offers  the  promise  of  detecting  landmines  with  little  or  no  metal  content. 
The  sensor  works  by  emitting  an  electromagnetic  wave  covering  a  large  frequency  band  into 
the  ground  through  a  wide-band  antenna.  Reflections  from  the  soil  caused  by  dielectric 
variations  such  as  the  presence  of  an  object  are  measured.  By  moving  the  antenna,  it  is 
possible  to  reconstruct  an  image  representing  a  vertical  slice  of  the  soil.  The  data  generated 
are  3-dimensional  and  correspond  to  depth,  down-track,  and  cross-track  (Figure  7.1).  Most 
discrimination  algorithms  process  only  2-D  slices  of  the  3-D  cube:  (down-track,  depth) 
or  (cross-track, depth).  The  performance  of  the  down-track  and  cross-track  discrimination 
algorithms  can  vary  significantly  depending  on  the  target  shape,  burial  orientation,  and 
other  environmental  conditions.  In  some  cases,  these  algorithms  can  provide  complementary 
evidence,  while  in  other  cases  they  provide  contradicting  evidence.  Thus,  effective  fusion  of 
these  algorithms  can  achieve  higher  probability  of  detection  with  fewer  false  alarms. 

To  train  discrimination  algorithms,  we  use  data  collected  with  known  target  loca¬ 
tions.  However,  only  the  (down-track,  cross-track)  position  can  be  extracted.  The  depth 
position  is  usually  unknown  as  it  depends  on  the  burial  depth,  height  of  target,  type  of 
soil,  height  of  GPR  antenna  above  the  ground,  etc.  Thus,  there  is  uncertainty  in  the  depth 
estimation  of  the  targets  that  can  affect  both  the  training  and  testing  phases  of  a  fusion 
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system.  For  training,  it  is  very  difficult  to  localize  the  objects  depth  automatically,  and  it 
is  a  very  tedious  process  to  do  it  manually.  Similarly,  during  testing,  it  is  not  trivial  how 
to  combine  partial  confidence  values  from  multiple  depths.  Therefore,  the  MIL  paradigm  is 
suitable  to  solve  this  problem. 

Several  landmine  discriminators  could  be  used  in  the  fusion  system.  In  this  disserta¬ 
tion,  we  validate  our  approach  using  four  discrimination  algorithms.  Two  of  the  algorithms 
are  based  on  the  Edge  Histogram  Descriptor  (EHD)  [117].  The  first  algorithm  processes  the 
2-D  (down-track,  depth)  slice  of  the  3-D  GPR  signal  to  generate  partial  confidence  values 
at  different  depths,  and  is  referred  to  as  EHDDT  (DT  indicates  down-track).  Similarly,  the 
second  algorithm  processes  the  2-D  (cross-track,  depth)  slice  and  is  referred  to  as  EHDCT 
(CT  indicates  cross-track).  The  other  two  discrimination  algorithms  are  based  on  the  Fisher 
Vector  features  [118].  In  a  like  manner  to  EHD,  one  of  the  algorithms,  called  Fisher Vec- 
torDT,  extracts  features  from  the  (down-track,  depth)  view,  the  second  algorithm,  called 
FisherVectorCT,  extracts  information  for  the  (cross-track,  depth)  view. 

In  the  following,  we  briefly  describe  the  GPR  data  and  present  the  discrimination 
algorithms.  More  details  can  be  found  in  [117, 119].  We  also  outline  the  extraction  of  two 
additional  features  that  are  used  to  refine  the  fusion  rules  when  necessary. 

7.1.1  GPR  data 

The  data  used  in  our  multi-algorithms  fusion  system  was  collected  using  a  vehicle 
mounted  GPR  (  as  shown  in  Figure  7.2).  As  the  vehicle  travels,  it  generates  a  3-Dimensional 
matrix  of  sample  values  (shown  in  Figure  7.1a)  that  correspond  to  depth,  down-track,  and 
cross-track,  S(z,  x,  y),  z  =  1,  ...,  IVd;  x  =  1, ... ,Nq ;  y  =  1, ...,  Ns,  where  z,  x,  and  y  represent 
depth,  cross-track,  and  down-track  positions  respectively,  and  Njj,  Nc ,  and  Ns  represents 
the  collected  sample  size  along  depth,  cross-track,  and  down-track  dimensions. 


104 


Figure  7.2:  Vehicle  mounted  GPR  system. 


7.1.2  EHDDT  and  EHDCT  algorithms 

The  EHDDT  is  the  same  as  the  standard  Edge  Histogram  Descriptor  (EHD)  algo¬ 
rithm  proposed  by  Frigui  et  al.  [117].  The  EHD  uses  translation  invariant  features,  that 
are  based  on  the  histogram  of  edges  in  the  GPR  signatures,  and  a  possibilistic  k— Nearest 
Neighbors  {k— NN)  rule  for  confidence  assignment  [120].  The  EHD  is  an  adaptation  of  the 
MPEG-7  EHD  feature  [121]  which  captures  the  signature’s  texture  as  feature  for  recogni¬ 
tion.  It  has  been  adapted  to  capture  the  spatial  distribution  of  the  edges  within  a  3— D 
GPR  data  volume.  To  keep  the  computation  simple,  2— D  edge  operators  are  used,  and  two 
types  of  edge  histograms  are  computed.  The  first  one  is  obtained  by  fixing  the  cross-track 
dimension  and  extracting  edges  in  the  (depth,  down-track)  plane.  The  second  edge  his¬ 
togram  is  obtained  by  fixing  the  down-track  dimension  and  extracting  edges  in  the  (depth, 
cross-track)  plane. 

Let  S^y  be  the  xth  plane  of  the  3— D  signature  S(z,x,y).  First,  for  each  siy\  four 
categories  of  edges  are  computed:  vertical,  horizontal,  45°  diagonal,  and  135°  anti-diagonal. 
If  the  maximum  of  the  edge  strengths  exceeds  a  preset  threshold,  the  corresponding  pixel 
is  considered  to  be  an  edge  pixel.  Otherwise,  it  is  considered  a  non  edge  pixel.  Next,  each 
Szij  image  is  vertically  subdivided  into  7  overlapping  sub-images  siy],  i  =  1, . . . ,  7.  For 

(x)  (x) 

each  Szyi,  a  5  bin  edge  histogram,  Hzyi ,  is  computed.  The  bins  correspond  to  the  4  edge 
categories,  and  the  non-edge  pixels. 
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The  EHD  is  defined  as  the  concatenation  of  the  7  five-bin  histograms.  That  is, 


EHD»(5^)  =  [Hzyi  Hzy2  HZy3  ...Hzyv\, 

-  (x) 

where  H zyi  is  the  cross-track  average  of  the  edge  histograms  of  sub-image  Szyi 
channels,  i.e., 


H 


zyi 


1 

Nc 


%) 

zyi' 


X=1 


(7.1) 


over  Nc 


(7.2) 


The  EHDCT  is  a  variation  of  the  standard  EHD  and  follows  the  same  feature  extrac¬ 
tion  process  described  above.  However,  while  the  EHDDT  is  mainly  based  on  the  (depth, 
down-track)  slices,  the  EHDCT  focuses  only  on  the  (depth,  cross-track). 

A  given  test  GPR  alarm  has  around  300  to  400  depth  values.  The  buried  object 
signature  is  not  expected  to  cover  all  the  depth  values.  Thus,  extracting  one  global  feature 
vector  from  the  alarm  may  not  discriminate  between  object  and  clutter  signatures  effectively. 
To  avoid  this  limitation,  each  potential  target  (identified  by  a  prescreener)  needs  to  be  tested 
at  multiple  depth  values.  Typically,  a  30  x  15  x  7  window  is  slided  along  the  depth  axis 
with  a  50%  overlap  between  2  consecutive  signatures.  A  total  of  17  signatures  are  extracted 
for  each  target.  Thus,  each  alarm  would  be  represented  by  a  bag  of  17  instances.  For  each 
instance  the  EHD  histograms  (EHDDT  and  EHDCT)  are  extracted.  Then,  a  possibilistic 
k— Nearest  Neighbors  (A;— NN)  rule  is  used  to  assign  partial  confidence  values  [120]  for  each 
instance  individually.  We  should  note  here  that  the  bag  representation  is  used  to  group 
features  from  multiple  depths,  and  is  not  used  in  an  MIL  context. 

7.1.3  Fisher  Vector  discrimination  algorithms 

The  Fisher  Vector  (FV)  extracts  features  at  multiple  depths  of  the  3-D  GPR  sig¬ 
natures.  First,  each  2-D  GPR  view  (i.e.,  (down-track,  depth)  or  (corss-track,  depth))  is 
divided  into  overlapping  60  windows  along  the  depth  axis.  Next,  each  window  is  in  turn 
divided  into  a  set  of  sub  patches  using  a  grid  partitioning.  Then,  128-D  dense  SIFT  [118] 
features  are  extracted  for  each  sub  patch,  a  sample  window  with  extracted  SIFT  feature 
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is  shown  in  Figure  7.3.  Finally,  the  FV  is  used  to  aggregate  the  extracted  set  of  features 
into  a  global  feature  vector  for  each  window.  In  total,  60  FV  features  are  extracted  for 
the  (down-track,  depth)  view  and  60  FV  features  are  extracted  for  the  (cross-track,  depth) 
view. 


Figure  7.3:  Sample  GPR  alarm  with  dense  SIFT  features  (only  first  and  last  features  are 
shown) 

The  FV  patch  aggregation  mechanism  is  based  on  the  Fisher  Kernel.  The  Fisher  Kernel 
characterizes  a  sub  patch  by  its  deviation  from  a  generative  model.  The  deviation  is  the  gra¬ 
dient  of  the  sub  patch  log-likelihood  with  respect  to  the  generative  model  parameters.  The 
vectorial  representation  of  all  the  deviations  is  called  the  Fisher  Vector  (FV).  For  instance, 
using  the  extracted  dense  SIFT  features  (descriptors),  a  generative  model,  such  as  Gaussian 
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Mixture  Model  (GMM)  with  K  words,  is  learned.  It  can  be  regarded  as  a  ’’probabilistic 
visual  vocabulary”.  Let  X  =  (xi, . . .  ,xn)  be  a  set  of  D  dimensional  feature  vectors.  Let 
0  =  (jUfc,  Efc,  nk  :  k  =  1 , ,K)  be  the  parameters  of  a  GMM  fitting  the  distribution  of  de¬ 
scriptors,  where  ttu  ,  jik ,  and  Efc  are  respectively  the  mixture  weight,  mean,  and  covariance 
matrix  of  Gaussian  k.  The  GMM  associates  each  vector  Xi  to  a  mode  k  in  the  mixture  with 
a  strength  given  by  the  posterior  probability: 


_  exp  [— 2(xi  -  Hk)T^k  X(xi  -  Hk)_ 

U  £*i  exP  [~5(xi  “  A*t)TSfc1(xi  -  Mi)] 

For  each  mode  k,  we  compute  the  mean  and  covariance  deviation  vectors 

i  N 

1  v  >  %ji  f^jk 


(7.3) 


1  w 


%ji  fJ'jk 
ajk 


-  1 


where  j  =  1,2 , . . .  ,D  spans  the  vector  dimensions.  The  FV  of  a  given  GPR  window  is  the 
concatenation  of  the  vectors  and  for  each  of  the  K  modes  in  the  Gaussian  mixtures, 
i.e., 

$(/)  =  [ui  •  •  •  uk  •  •  •  vi  •  •  •  vk  •  •  •  ]  (7.4) 


Due  to  the  absence  of  ground  truth  at  the  window  level,  a  simple  heuristic  was  used  to  label 
the  data.  It  consists  of  assigning  positive  labels  to  windows  with  high  energy,  and  negative 
otherwise.  Having  labeled  the  windows,  SVM  is  then  used  to  learn  a  classifier  and  assign 
partial  confidences  to  the  extracted  60  windows.  Thus,  each  alarm  would  be  represented  by 
a  bag  of  60  instances.  Each  instance  is  a  2-D  vector  composed  of  FisherVectorDT  confidence 
value  and  FisherVectorCT  confidence  value. 


7.1.4  Auxiliary  Feature  Extraction 

In  some  cases,  EHDDT  and  EHDCT  algorithms  can  provide  complementary  evi¬ 
dence,  while  in  other  cases  they  provide  contradicting  evidence.  In  the  later  case,  a  fusion 
system  needs  to  trust  one  algorithm  over  the  other.  This  can  be  achieved  by  learning  ap¬ 
propriate  linear  combination  of  weights  for  algorithms  within  each  local  context.  For  this 
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method  to  be  effective,  extracted  local  contexts  need  to  have:  (i)  a  consistent  algorithm  that 
can  be  trusted  and  can  lead  to  a  better  discrimination,  or  (ii)  have  a  trivial  solution  due 
to  context  purity  (a  pure  context  includes  mainly  target  signatures  and  only  few  to  none 
non-targets  signatures,  or  vice  versa).  However,  because  of  the  low  dimensionality  of  the 
available  inputs  (only  2  dimensions,  EHDDT  and  EHDCT),  in  some  regions  of  the  input 
space  it  may  be  difficult  to  obtain  contexts  in  which  a  combination  of  the  algorithms  will 
improve  the  discrimination  results.  To  improve  the  partition  of  the  input  space,  we  extract 
auxiliary  features  synthesized  from  the  shape  of  the  radar  signal  at  certain  depths. 

In  the  following,  we  outline  the  extraction  of  two  auxiliary  features:  SignatureWidth 
for  Down-track;  and  SignatureWidth  for  Cross- Track.  As  the  names  indicate,  and  by 
analogy  to  EHDDT  and  EHDCT,  the  two  additional  features  consist  of  the  effective  width 
of  the  strong  components  within  the  GPR  signal  along  (depth,  down-track)  slices  and  the 
width  along  (depth,  cross-track)  slices. 

Let  £>)(/) ,y  be  the  2  dimensional  signature  corresponding  to  the  measured  radar 
signal  collected  at  a  fixed  cross-track  position  (referenced  here  by  x)  and  encapsulating  the 
30  depth  bins  starting  at  z^y  In  other  words,  B^} _y  is  one  of  the  17  signatures  (instances) 
of  one  alarm.  Similarly,  let  B~(i)}x  be  the  2  dimensional  signature  at  a  fixed  down-track 
position  (referenced  here  by  y ).  Figure  7.4  displays  3  signatures  extracted  from  target  and 
non-target  GPR  alarms.  As  it  can  be  seen,  target  signatures  can  be  characterized  by  a 
right  rising  edge  (45°  diagonal),  and  a  left  decreasing  edge  (135°  anti-diagonal).  Typically, 
wider  structures  (covering  more  than  11  scans)  can  indicate  the  presence  of  an  object  of 
interest  (due  to  known  target  sizes),  and  should  lead  to  a  higher  probability  of  detection. 
SignatureWidth  auxiliary  features  are  based  on  this  observation. 

To  extract  the  SignatureWidth  of  a  given  instance,  we  use  two  of  the  edges  com¬ 
puted  for  the  EHD:  45°  diagonal,  and  135°  anti-diagonal.  These  diagonal  and  anti-diagonal 
edge  strengths  are  summed  along  the  depth  dimension.  The  resulting  1-D  signals,  called 
hereafter  DGStrength  and  ADStrength  respectively,  are  normalized  by  the  number  of 
instance  depths  (i).  By  thresholding  the  later  signals  we  can  extract  two  key  locations 
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Figure  7.4:  Target  and  Non- Target  signatures. 

that  define  the  spread  of  the  strongest  component  within  the  instance,  and  thus  obtain 
the  SignatureWidth.  These  two  key  locations  are  respectively  the  points  SD ,  where  the 
DG  Strength  starts  rising  above  a  threshold  value  DGThresh ,  and  SA,  where  AD  Strength 
starts  decreasing  below  a  threshold  value  ADThresh. 

Formally,  SD  is  defined  as 

SD  =  min{i  \  DGStrengthi  >  DGThresh}  (7.5) 

Similarly, 

SA  =  max{i  \  ADStrengthi  >  ADThresh}  (7-6) 
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Figure  7.5:  Illustration  of  the  identification  of  the  SA  and  SD  points. 


The  SignatureWidth  is  then  defined  as 


SignatureW  idth 


SA  -  SD  if  SA  >  SD, 
0  otherwise 


(7.7) 


The  identification  of  the  SD  and  SA  points  are  illustrated  in  Figure  7.5.  Examples 
of  SignatureWidth  features  are  shown  in  Figure  7.6 

Let  SignatureW idthDT  be  the  width  feature  for  Down-track  and  SignatureW  idthCT 
be  the  width  feature  for  Cross- Track.  Thus,  each  alarm  is  represented  by  a  bag  of  17  in¬ 
stances  extracted  at  multiple  depths.  Each  instance  include  2  features:  SignatureW idthDT , 
SignatureW  idthCT. 


Ill 


(a)  Down- Track,  width  =  13  (b)  Cross-Track,  width  =  6 

Figure  7.6:  Examples  of  SignatureWidthDT  (gprDT)  and  SignatureWidthCT  (gprCT)  fea¬ 
tures  for  a  target  object. 

7.1.5  Data  Collection 

GPR  data  collected  at  different  locations  and  different  dates  were  used  to  evaluate 
our  algorithms.  In  particular,  two  collections  were  used  to  train  and  test  the  proposed 
fusion  methods.  The  first  collection,  Collection-1,  was  collected  from  two  different  sites 
and  covers  a  variety  of  anti-tank  mines  including  319  encounters  of  anti-tank  with  high 
metal  content(ATHM)  and  422  encounters  of  anti-tank  with  low  metal  content  (ATLM).  In 
addition,  a  variety  of  clutter  objects  were  surveyed  in  an  effort  to  test  the  robustness  of  the 
fusion  algorithms.  The  targets  were  buried  up  to  8  inches  deep.  First,  a  prescreener  [122] 
is  used  to  process  the  GPR  data  and  identify  regions  of  interest  to  be  processed  further 
by  the  discrimination  algorithms.  The  prescreener  identified  700  target  encounters  and  330 
non-targets  (false  alarms).  Collection-1  is  used  in  the  following  to  perform  10-fold  corss- 
validation.  The  second  collection,  Collection-2,  was  collected  from  three  different  sites.  The 
first  two  sites  cover  789  target  encounters  of  which  339  were  of  type  ATHM  and  450  ATLM, 
also  1577  non-targets  were  identified  in  the  first  two  sites.  The  third  site  of  Collection-2 
covers  1948  targets  (847  ATHM  and  1097  ATLM)  and  3018  non-targets.  In  the  following, 
Site  1  &  Site  2  of  Collection-2  will  be  exclusively  used  for  training  and  Site  3  will  be 
exclusively  used  for  testing. 

In  the  next  section,  we  describe  the  fusion  system  that  we  have  developed  based 
on  a  traditional  Mamdani  inference  system  [123]  and  using  4  features:  EHDDT,  EHDCT, 
SignatureW idthDT ,  SignatureW idthCT. 
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7.2  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  Traditional  Fuzzy 
Inference 

Our  goal  is  to  design  a  system  which  accepts  (as  input)  arbitrary  sets  of  discrimina¬ 
tion  confidence  values  and  additional  contextual  knowledge  (such  as  SignatureWidth) ,  and 
be  able  to:  1)  derive  a  set  of  fuzzy  rules  from  the  available  input  knowledge;  2)  learn  asso¬ 
ciated  output  fuzzy  sets;  and  3)  output  a  final  confidence  value  representing  the  degree  to 
which  a  GPR  alarm  should  be  considered  as  a  target.  To  fulfill  this  functional  requirement, 
first,  we  design  two  traditional  fuzzy  inference  systems,  based  on  Mamdani  inference,  and 
ANFIS.  Next,  we  develop  fusion  systems  using  our  proposed  multiple  instance  fuzzy  infer¬ 
ence  framework.  In  particular,  we  develop  fusion  methods  based  on  our  MI-Mamdani  and 
MI-ANFIS  and  compare  their  performances  to  that  of  traditional  fuzzy  inference  systems. 

Given  that  traditional  fuzzy  systems  cannot  learn  from  ambiguously  labeled  data, 
information  about  correct  target  depths  need  to  be  provided  (i.e.,  instances  need  to  be 
labeled).  To  do  so,  for  each  positive  bag  we  assign  a  positive  label  to  the  instances  with  the 
highest  energy  (energy  can  be  computed  by  taking  the  sum  of  the  absolute  values  of  GPR 
signals  within  an  instance),  also  a  human  expert  is  used  to  validate  the  labeling.  On  the 
other  hand,  our  multiple  instance  framework  does  not  require  labels  at  the  instance  levels 
to  be  available  and  can  learn  from  ambiguously  labeled  data.  In  the  following,  we  show  that 
even  though  our  framework  does  not  require  instances’  labels,  it  provided  better  results, 
this  is  because  it  uses  all  available  information  of  a  given  bag  to  perform  fuzzy  reasoning. 

7.2.1  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  Mamdani  Fuzzy 
Inference 

To  learn  traditional  Mamdani  fuzzy  rules,  first,  the  input  space  is  partitioned  to  iden¬ 
tify  local  contexts.  Second,  input  membership  functions  are  learned  based  on  the  statistics 
of  the  partial  confidence  values  of  the  input  features  (partial  confidence  values  and  auxil¬ 
iary  features)  within  each  context.  Third,  output  membership  functions  are  generated  by 
considering  the  distributions  of  targets  and  non-targets  within  each  context.  Finally,  the 
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input  and  output  membership  functions  are  combined  into  a  Mamdani-type  fuzzy  inference 
system.  The  output  of  the  learning  process  is  a  fuzzy  rule  base  (FRB)  adapted  to  different 
contexts. 

For  this  task,  we  have  generated  N  =  3050  training  observations,  from  Collection-1, 
with  desired  output  T  =  {tj\j  =  1, . . . ,  N}  that  correspond  to  instances  processed  by  differ¬ 
ent  discrimination  algorithms  and/or  background  features  extractors  (EHDDT,  EHDCT, 
SignatureWidthDT ,  SignatureW idthCT) .  From  each  non-target  alarm,  we  selected  5  in¬ 
stances  at  an  equal  sampling  interval,  and  from  each  target  alarm  we  selected  2  instances 
intuitively  selected  based  on  the  highest  value  of  the  combined  EHDDT  and  EHDCT  con¬ 
fidence  values.  An  expert  is  used  to  label  the  data. 

The  partial  confidence  values  of  a  given  discriminator  d  are  denoted  by  Td  =  {ydj\j  = 
1, . . . ,  N}.  Each  auxiliary  feature  e  in  denoted  by  Be  =  {bej\j  =  1, . . . ,  N}.  The  D  (D  =  2) 
discriminators  and  E  (E  =  2)  background  features  are  then  concatenated  to  generate  one 
global  descriptor  for  each  observation: 

D  E 

*  =  (U  U  (U  Be">  =  {XJ  =  bi?’  •  •  • !  VDj >  b\j , . .  .,bEj]\j  =  !,•••,  N}  .  (7.8) 

d=l  e=l 

To  simplify  notation,  we  will  use  Xij  to  donate  either  or  bij,  and  rewrite  (7.8)  using: 

D  E 

x  =  (1J  yd)  u  ( U  Be">  =  =  ■  ■  ■  >X(K=D+E)j]\j  =  1,...,N}.  (7.9) 

d=l  e=l 

The  proposed  fusion  system  can  be  expressed  by  means  of  a  fuzzy  rule  base  composed 
of  a  union  of  if  —  then  fuzzy  rules.  A  typical  Mamdani-style  fuzzy  rule,  TV  has  the  following 
form: 


TV  :  If  x'i  is  M\  and  x 2  is  Mi, , . . . ,  and  xk  is  MlK,  then  ol  is  Cl.  (7-10) 

In  (7.10)  7 Zl,i  =  1,2, ...  ,r,  is  the  ith  fuzzy  rule,  Mj  is  a  fuzzy  set  associated  with 
the  jth  input,  and  Cl  is  the  fuzzy  set  describing  the  output  of  the  ith  rule.  The  FRB  is  the 
union  of  all  rules: 
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(7.11) 


frb  =  \Jn\ 

1=1 

The  fuzzy  sets  in  (7.10)  consist  of  linguistic  labels  characterized  by  parameterized 
membership  functions  Ml  and  Cl .  We  use  trapezoidal  membership  functions  that  can  be 
completely  determined  by  four  scalar  parameters  l ,  rn,  h,  and  u,  where  l  and  u  locate  the 
’’feet”  (support)  of  the  trapezoid  and  the  parameters  m  and  h  locate  the  ’’shoulders”  (core). 
Formally, 


Ml(xk)  =  max(min( 


xk-lj  1  u%-xk, 

m\  ’  ul  ~  K ' 


,o). 


(7.12) 


For  the  rules’  outputs  (i.e.  C1),  we  use  Gaussian  membership  functions: 


-(v-4)2 

C‘(u)  -  c  ^  \  (7.13) 

where  cl0  and  u*  are  the  mean  and  variance  of  the  gaussian  function. 

Identifying  the  FRB  in  (7.11)  is  equivalent  to  identifying  its  underlying  parameters: 

1.  Premise  parameters:  V  =  {/).,  rri'k,  h\.,ulk  \  i  =  1,  2 , . . .  ,r;k  =  1,  2, . . . ,  K };  and 

2.  Consequent  parameters:  C  =  {c* ,  cr*  |  i  =  1,  2, . . . ,  r} 

To  identify  the  premise  parameters  we  first  cluster  the  N  training  observations  along 
each  dimension  k  =  1 ...  K  into  rt  clusters  using  the  K- means  algorithm  [124],  The  K- means 
returns  a  list  of  clusters’  centers(C)  and  the  set  of  points  associated  with  each  cluster.  Then, 
the  Premise  parameters  are  derived  from  the  clusters’  centers  and  widths  (cl,  (j?)  along  each 
input  dimension  by  transforming  the  Gaussian  membership  function,  defined  by  the  cluster’s 
center  and  width  (clk,  a2k)  to  a  trapezoidal  one  using: 


c>j  o  x  a) 

rn)  =  c)  -  P  X  a) 

<  such  that  a  >  /3  (7-14) 

hj  =  Cj  +  P  x  (jj 

Uj  =  Cj  +  a  x  a j 
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Trapezoidal  membership  functions  have  lager  cores  and  are  more  suitable  for  fuzzi¬ 
fication  of  discriminators’  confidence  values.  In  (7.14),  the  parameters  a  and  (5  control  the 
width  of  the  core  and  support  of  the  trapezoidal  functions. 

To  learn  the  consequent  parameters  we  count  the  number  of  target  and  non-target 
instances  within  each  region  of  the  input  space  (i.e. ,  clusters  generated  by  K-means  in 
the  previous  step).  Then,  the  proportion  of  target  instances  is  used  as  the  mean  of  the 
output  membership  function  (i.e.,  c*)  and  the  width  is  fixed.  Using  this  assignment,  regions 
dominated  by  target  instances  will  have  an  output  closer  to  1,  while  regions  dominated  by 
non-targets  will  have  an  output  closer  to  0. 

Once  the  FRB  is  identified,  we  use  the  inference  process  described  in  Section  2.3.1. 
For  each  new  depth  instance,  we  start  by  fuzzification  of  the  input.  The  fuzzification  role  is 
to  determine  the  membership  degree  of  each  input  dimension  in  the  rules’  input  fuzzy  sets. 
After  this  step  the  implication  is  executed  using  the  product  as  a  joint  (and)  operator,  this 
will  lead  to  some  rules  being  activated  with  different  degrees.  Then,  the  rules’  outputs  are 
aggregated,  and  defuzification  is  executed  to  produce  a  crisp  confidence  value  indicating  the 
degree  to  which  the  instance  should  be  considered  as  a  target.  To  test  a  GPR  alarm,  each 
depth  instance  is  fed  to  the  system  and  its  partial  confidence  value  is  computed  .  Then  a 
final  confidence  value  is  assigned  to  the  alarm  by  taking  the  average  of  the  top  3  instances 
with  largest  confidence  values  [117]. 

7.2. 1.1  Rule  Generation 

First,  the  confidence  values  of  the  EHDDT  and  EHDCT  discriminators  as  well  as 
SignatureWidthDT  and  SignatureWidthCT  background  features  are  extracted.  To  parti¬ 
tion  the  input  space  using  the  K-means  algorithm,  the  EHDDT  and  EHDCT  were  divided 
into  3  fuzzy  sets  as  following:  Low,  Medium  ,  and  High.  Whereas  the  SignatureWidthDT 
and  SignatureWidthCT  were  quantized  into  Narrow,  Medium,  and  Wide  fuzzy  sets.  For 
the  Gaussian  output  membership  function,  we  set  cr  to  0.05.  This  partitioning  generates  a 
total  of  81  clusters.  We  discard  clusters  that  have  few  samples  (<  10).  This  results  in  21 
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Figure  7.7:  Illustration  of  the  generated  Mamdani  Fuzzy  Rule  Base  (FRB),  showing  4  of 
the  21  rules. 

The  rules  obtained  are  intuitive  and  easily  interpret  able.  For  instance  in  Figure 
7.7,  we  display  4  of  the  21  rules,  when  the  input  =  [EHDDT  =  3.75,  EHDCT  = 
2.61,  SignatureWidthDT  =  9.73,  SignatureWidthDT  =  10.3].  Rule  1  and  2  state  the 
following: 

1Z1  If  EHDDT  is  High  and  EHDCT  is  Low  and  SignatureWidthDT  is  Medium 

and  SignatureW idthCT  is  Wide  then  o1  is  High.  (7.15) 

1Z2  :  If  EHDDT  is  Low  and  EHDCT  is  Low  and  SignatureWidthDT  is  Narrow 

and  SignatureWidthCT  is  Narrow  then  o2  is  Low.  (7.16) 

Rule  3  is  identical  to  Rule  2  expect  the  SignatureWidthDT  and  SignatureWidthDT  are 
now  both  high.  As  a  result,  the  output  increases  from  Low  to  Medium. 

E?  :  If  EHDDT  is  Low  and  EHDCT  is  Low  and  SignatureWidthDT  is  High 

and  SignatureWidthCT  is  High  then  o3  is  Medium.  (7.17) 

7.2.2  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  ANFIS 

In  the  following,  we  outline  a  fusion  method  based  on  Adaptive  Neuro  Fuzzy  Infer¬ 
ence  Systems  (ANFIS)  [125]  capable  of  simultaneously  identifying  local  contexts  as  well  as 


117 


learning  optimal  weights  for  combining  local  expert  discriminators. 

Given  the  same  training  data  used  to  train  the  previous  fusion  method  (i.e.,  X  and 
T),  we  use  ANFIS  to  iteratively  achieve:  1)  structure  identification,  which  relates  to  deter¬ 
mining  the  number  of  fuzzy  if-then  rules  and  an  optimal  partition  of  the  input  space,  and 
2)  parameter  identification,  which  involves  learning  of  the  optimal  partitions  (contexts)  and 
combination  weights.  To  learn  the  rules,  first,  the  input  space  is  partitioned  to  identify  lo¬ 
cal  contexts.  Second,  input  membership  functions  are  learned  based  on  the  statistics  of  the 
partial  confidence  values  of  the  individual  discriminators  as  well  as  additional  background 
information  within  each  context.  Third,  the  output  parameters  of  the  rules  are  initialized 
using  a  least  squares  estimator  (LSE).  Finally,  the  input  and  output  membership  functions 
are  combined  into  a  Sugeno-type  fuzzy  inference  system.  The  resulting  ANFIS  system  is 
then  trained  using  a  hybrid  learning  algorithm  [125].  The  output  of  the  learning  process  is 
a  fuzzy  rule  base  adapted  for  different  contexts. 

As  detailed  in  Section  2.3.3,  ANFIS  can  be  expressed  by  means  of  a  fuzzy  rule  base 
(FRB)  composed  of  a  union  of  Sugeno  type  if-then  fuzzy  rules.  A  typical  Sugeno  fuzzy  rule 
has  the  following  form: 

TV  :  If  x\  is  M\  and  x 2  is  M \ , . . . ,  xk  is  MlK ,  theno*  =  a\  x  x±  +  al2  x  + . . .  +  alK  x  xk  +  • 

(7.18) 

Where  1 Zl,  i  =  1,  2, . . . ,  r,  denotes  the  zth  fuzzy  rule.  Mj  is  a  fuzzy  set  associated  with  the 
jth  fusion  input,  a)-  is  a  weight  assigned  to  the  jth  discriminator  or  background  feature, 
and  bl  is  a  constant.  As  before,  the  FRB  is  then  obtained  by  taking  the  union  of  all  rules: 

r 

FKB  =  [_)n\  (7.19) 

1=1 

In  this  fusion  system,  we  use  gaussian  membership  functions  that  can  be  completely 
determined  by  two  scalar  parameters  c  and  a,  the  center  and  width  of  the  gaussian  function. 

=  eM~{Xi~Cj-  )•  (7-20) 

z  x  aj 
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ANFIS  parameters  are  then, 


1.  Premise  parameters: 

V  =  =  =  !,  2, , . . ,  and 

2.  Consequent  parameters: 

C  =  =  1,2, ...  ,r;j  =  1,2,...,  icj 

To  identify  the  premise  parameters  (i.e.  the  parameters  of  the  membership  functions 
ill*),  we  cluster  the  N  training  observations  into  r  clusters  using  the  FCM  algorithm  [60]. 
FCM  returns  a  list  of  clusters’  centers(C)  and  a  partition  matrix  (U).  The  premise  param¬ 
eter  its  are  then  derived  from  clusters’s  centers  and  widths  using  (2.57).  To  initialize  the 
rules’  output  parameters  we  use  an  ordinary  least  squares  estimator  as  defined  in  (2.58). 

Once  the  structure  of  the  network  is  defined  and  initialized,  we  continue  with  the 
learning  process  that  yields  a  network  with  a  fine-tuned  membership  functions.  Thus, 
fine-tuned  contexts.  Each  rule  can  be  viewed  as  a  context  with  its  associated  optimal 
combination  weights  (consequent  parameters).  When  testing,  an  instance  will  activate 
certain  rules  (contexts)  to  certain  degrees  and  the  network  output  will  be  the  weighted 
average  off  all  rules  outputs  combined. 

As  before,  to  test  a  GPR  alarm,  each  depth  instance  is  fed  through  the  ANFIS  network  and 
its  partial  confidence  value  is  computed  as  the  defuzzification  of  all  rules  outputs.  Then  a 
final  confidence  value  is  assigned  to  the  alarm  by  taking  the  average  of  the  top  3  instances 
with  largest  confidence  values. 

7. 2. 2.1  Rule  Generation 

First,  the  conffidence  values  of  the  EHDDT  and  EHDCT  discriminators  as  well  as 
SignatureWidthDT  and  SignatureWidthCT  auxiliary  features  are  extracted.  Then,  the 
input  space  is  partitioned  using  the  FCM  algorithm  into  16  clusters.  Next,  ANFIS  param¬ 
eters  are  identified  as  described  above.  Finally,  rules’s  parameters  are  fine-tuned  using  the 
hybrid  learning  algorithm  outlined  in  Section  2.3.3.  The  output  of  this  process  is  a  rule 
base  optimized  for  the  fusion  of  multiple  landmine  detection  algorithms.  Figure  7.8  is  an 
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illustration  of  2  of  the  16  learned  rules,  when  the  input  =  [EHDDT  =  3.75,  EHDCT  = 
2.61,  SignatureWidthDT  =  9.73,  SignatureWidthDT  =  10.3].  Rule  1  and  2  state  the 
following: 

7Z1  If  EHDDT  is  Low  and  EHDCT  is  Low  and  SignatureWidthDT  is  Medium. 
and  SignatureW idthCT  is  Narrow  then 

o1  =  15.3  x  EHDDT  +  0.1358  x  EHDCT  -  7.681  x  SignatureWidthDT  +  3.921  (7.21) 

x  SignatureWidthCT  —  116.8  (7.22) 

TZ1  If  EHDDT  is  Medium  and  EHDCT  is  Medium  and  SignatureWidthDT  is  Wide 
and  SignatureW  idthCT  is  Wide  then 

o1  =  -1.594  x  EHDDT  -  0.05  x  EHDCT  +  0.0058  x  SignatureWidthDT  -  1.686  (7.23) 

x  SignatureWidthCT  +  45.8  (7.24) 


EHDDT  EHDCT  widthDT  widthCT 


/ 

Figure  7.8:  Illustration  of  the  generated  ANFIS  Fuzzy  Rule  Base  (FRB),  showing  2  of  the 
16  rules. 


ANIFS  rules  (Sugeno  rules  in  general)  are  not  as  interpretable  as  Mamdani  rules. 
However,  they  are  more  optimized  for  the  desired  fusion  application  and  yield  better  results 
as  shown  in  the  next  section. 

7.2.3  Results 

Figure  7.9  displays  a  scatter  plot  of  EHDDT  vs.  EHDCT.  As  it  can  be  seen,  the 
two  detectors  are  consistent  for  most  targets  and  false  alarms  (FA).  However,  there  are  sev¬ 
eral  cases  where  the  confidence  values  are  not  consistent.  For  instance,  region  R2  includes 
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samples  where  the  EHDDT  discriminator  performed  better  than  the  EHDCT  discrimina¬ 
tor.  Similarly,  EHDCT  can  help  identify  targets  (e.g.,  within  i?l)  that  may  be  missed 
by  EHDDT.  In  some  cases  were  both  discriminators  agree  on  low  confidence  values,  Sig- 
natureWidth  auxiliary  features  can  help  increase  the  final  confidence  value  as  shown  in 
Mamdani  Rule  3  in  Figure  7.7. 

We  compare  the  performance  of  the  proposed  fusion  methods  (i.e.,  Mamdani  and  ANIFS) 
to  the  individual  discriminators  and  two  global  fusion  methods:  (i)  the  first  global  fusion 
method  performs  the  geometric  mean  of  EHDDT  and  EHDCT,  (ii)  the  second  global  fu¬ 
sion  method  performs  the  geometric  mean  of  EHDDT,  EHDCT,  SignatureWidthDT ,  and 
SignatureW  idthCT. 


EHDDT 

Figure  7.9:  Comparison  of  the  performances  of  EHDDT  and  EHDCT  discriminators. 

The  individual  discriminators  and  the  proposed  fusion  were  trained  and  tested  using 
10-folds  cross  validation.  Figure  7.10  displays  the  ROC’s  of  all  methods.  As  it  can  be  seen, 
the  proposed  Mamdani  fuzzy  fusion  method  outperformed  the  two  global  fusion  methods 
and  all  of  the  individual  discriminators.  This  is  due  mainly  to  the  localized  approach  used 
by  our  system  to  better  define  local  contexts  by  means  of  fuzzy  rules  resulting  in  a  better 
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FAR  (scoring  area:  12383.3  m2)  x  iq'3 


Figure  7.10:  Comparison  of  the  individual  discriminators  and  the  proposed  fuzzy  fusion 
method. 

combination  of  the  inputs.  However,  ANFIS  gave  the  best  overall  performance.  In  addition 
to  being  a  localized  approach,  ANFIS  jointly  identify  local  contexts  and  learns  optimal 
weights  for  combining  local  discriminators. 

7.3  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  Multiple  In¬ 
stance  Fuzzy  Inference 

Discrimination  algorithms  detect  target  candidates  only  in  two-dimensions  (down- 
track  and  cross-track  position).  Thus,  there  is  uncertainty  in  the  depth  estimation  of  the 
targets  that  can  affect  both  the  training  and  testing  phases  of  a  fusion  system.  For  training, 
it  is  very  difficult  to  localize  the  objects  depth  automatically,  and  it  is  a  very  tedious  process 
to  do  it  manually.  Similarly,  during  testing,  it  is  not  trivial  to  combine  partial  confidence 
values  from  the  multiple  windows. 

The  fusion  training  data  are  already  grouped  into  bags.  Each  bag  represents  a  GPR 
alarm  and  has  instances  extracted  at  multiple  depths.  Labels  for  the  bags  are  available  as 
binary  ground  truth:  target/non-target  (positive/negative).  This  formulation  fits  perfectly 
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TABLE  7.1 


MI-Mamdani  Parameters 


Number  of  MI  Rules 

5 

Number  of  Inputs 

4 

Membership  functions 

trapezoidal  MFs 

MFs’  parameters 

learned  using  FCMI 

Number  of  Training  bags 

1030:  700  positive  bags  and  330  negative  bags 

Output  parameters 

singleton  fuzzy  set  {1}. 

Truth  instances  aggregation 

average  of  top  3. 

the  MIL  paradigm. 

In  the  following,  we  develop  two  multiple  instance  fuzzy  inference  systems  for  the 
purpose  of  discriminators  and  auxiliary  features  fusion.  The  first  system  is  based  on  the  pro¬ 
posed  MI-Mamdani  inference,  and  the  second  system  is  based  on  the  proposed  MI-ANFIS. 
In  addition  we  conduct  two  experiments:  In  first  experiment,  as  the  previous  paragraph 
we  use  EHDDT,  EHDCT,  SignatureWidthDT ,  and  SignatureWidthCT  to  design  a  fu¬ 
sion  system  using  MI-FISs.  In  the  second  experiment,  we  fuse  the  outputs  of  the  EHDDT, 
EHDCT,  FisherVectorDT  and  Fisher VectorCT  discriminators. 


7.3.1  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  MI-Mamdani 

To  learn  an  MI-Mamdani  system  from  the  training  data  (bags)  for  the  purpose  of 
fusion  of  discrimination  algorithms,  first,  we  apply  the  FCMI  to  extract  concept  points. 
Next,  we  generate  multiple  instance  fuzzy  rules  from  concept  points  as  outlined  in  Section 
4.3.  Finally,  the  learned  rules  are  combined  into  an  MI-Mamdani  multiple  instance  fuzzy 
inference  system.  We  note  that  to  aggregate  the  truth  instances  at  the  rules’  level  we  used 
an  Ordered  Weighted  Averaging  Operator  (OWA)  that  outputs  the  average  of  the  top  three 
highest  truth  instances. 

After  running  FCMI,  5  concept  points  are  identified  and  used  to  identify  the  parameters  of 
5  multiple  instance  fuzzy  rules.  The  resulting  rule  base  is  illustrated  in  Figure  7.11.  Table 

7.1  summarizes  the  parameters  used  to  identify  the  fusion  rules. 

The  rules  of  our  MI-Mamdani  system  describe  concepts  inferred  from  FCMI  concept 
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Figure  7.11:  MI-Mamdami  multiple  instance  fuzzy  rules. 


points.  If  a  given  target  has  an  instance  that  can  be  described  by  any  of  the  concepts  it 
will  lead  to  a  high  defuzzihed  output,  and  eventually  to  positive  detection.  However,  non¬ 
targets  should  not  have  any  instance  within  positive  concepts  and  they  will  get  assigned 
low  output. 


7.3.2  Fusion  of  Multiple  Landmine  Detection  Algorithms  Using  MI-ANFIS 

For  this  experiment,  we  construct  a  zero-order  MI-ANFIS  (constant  consequent  pa¬ 
rameters)  having  5  multiple  instance  rules,  and  employing  Gaussian  MFs  to  describe  the 
input  fuzzy  sets.  To  initialize  the  system’s  parameters,  first,  we  use  the  FCM  algorithm  to 
cluster  the  instances  that  belong  to  positive  bags  into  5  clusters,  and  we  initialize  the  MFs’ 
centers  as  the  clusters’  centers.  Then,  we  set  the  standard  deviations  of  the  input  MFs  to 
a  preset  value  of  1.  Finally,  we  set  the  output  parameters  to  1.  Table  7.2  summarizes  all 
parameters  used  in  training  the  MI-ANFIS. 

After  initialization,  we  run  MI-ANFIS  basic  learning  algorithm  (Algorithm  5.2)  to  jointly 
learn  a  fuzzy  description  of  the  positive  concepts  as  well  as  optimal  rules’  output.  Fig¬ 
ure  7.12  is  a  graphical  representation  of  the  5  multiple  instance  rules  prior  to  running  the 
optimization  process  (dotted  line  curves)  and  the  learned  rules  after  training  (continuous 
curves).  Figure  7.13  plots  the  root  mean  squared  error  (RMSE)  vs.  the  training  epoch 
number.  The  fuzzy  sets  of  the  rules’  antecedents  describe  the  location  and  the  extent  of 
the  positive  concepts  in  the  4-D  instance  feature  space.  The  rules’  consequent  values  can 
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TABLE  7.2 


MI-ANFIS  Training  Parameters 


Number  of  MI  Rules 

5 

Number  of  Inputs 

4 

Membership  functions 

Gaussian  MFs 

MFs’  centers 

initialized  using  FCM 

MFs’  standard  deviations 

preset  to  1  (at  epoch  number  0) 

Output  parameters 

constants  =  1  ({bl0  =  1}^=1,  at  epoch  number  0) 

Number  of  Training  bags 

1030:  700  positive  bags  and  330  negative  bags 

Number  of  Training  Epochs 

150 

Parameter  a  used  in  softmax  function 

2 

Learning  rate 

0.1 

be  interpreted  as  an  assessment  of  the  “positivity”  of  each  learned  concept.  For  instance, 
the  MI-ANFIS  learned  the  following  two  positive  concepts  to  describe  targets: 


1Z1  :  If  EHDDT  is  Medium  and  EHDCT  is  Medium  and  SignatureW idthDT  is  High 

and  SignatureWidthCT  is  High  then  o1  =  1.15.  (7-25) 


1Z2  :  If  EHDDT  is  Medium  and  EHDCT  is  Low  and  SignatureWidthDT  is  High 

and  SignatureWidthCT  is  High  then  o2  =  0.94.  (7.26) 


7.3.3  Results 

The  proposed  fusion  methods  were  trained  and  tested  using  10- fold  cross  validation 
on  Collection-1.  Figure  7.10  displays  the  ROC’s  (averaged  over  the  10  fold)  of  all  methods. 
To  provide  a  quantitative  evaluation  of  the  proposed  multiple  instance  fuzzy  inference  fusion 
methods,  we  compare  its  performance  to  the  previously  presented  fusion  methods  (Mam- 
dani,  ANIFS  and  the  two  global  geometric  mean  methods).  We  also  compare  MI-Mamdani 
and  MI-ANFIS  performances  to  a  naive  MIL  implementation  of  Mamdani  (NaiveMamdani) 
and  ANFIS  (NaiveANFIS)  where  all  instances  from  positive  bags  are  considered  positive 
and  all  instances  from  negative  bags  are  considered  negative. 

Figure  7.14  displays  the  ROC’s  of  all  methods.  Figure  7.15  shows  the  ROC’s  of  MI- 
Mamdani,  Mamdani,  and  NaiveMamdani  fusion  methods  and  the  individual  discriminators. 
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Figure  7.12:  MI-ANIFS  fusion  rules  before  and  after  training  (Dotted  lines  indicate  the 
initial  MFs). 

Figure  7.16  displays  ROC’s  of  MI-ANFIS,  ANFIS,  and  NaiveANFIS  fusion  methods  as  well 
as  the  individual  discriminators.  As  it  can  be  seen  in  Figure  7.14,  MI-ANFIS  performed  bet¬ 
ter  than  the  standard  ANFIS  on  the  large  dataset,  and  as  expected  NaiveANFIS  performed 
worse.  MI-Mamdani  outperformed  the  individual  discriminators  (EHDDT  and  EHDCT) 
and  the  NaiveMamdani  fusion  method.  The  standard  Mamadani  and  ANFIS  performed 
better  at  low  FAR  (False  Alarms  Rate),  the  reason  behind  this  is  that  strong  Mines  are 
easy  to  identify  manually  and  in  this  case,  the  ground  truth  helps.  However,  weaker  mine 
signatures  are  not  as  easy  to  localize,  so  the  truth  may  not  be  as  accurate  and  can  degrade 
the  performance.  Overall,  MI-ANFIS  outperformed  all  presented  fusion  approaches  and  the 
individual  discriminators  (EHDDT  and  EHDCT).  This  is  due  to  the  ability  of  MI-ANFIS 
to  overcome  labeling  ambiguity  by  learning  meaningful  concepts. 

In  the  second  experiment,  we  used  the  same  settings  as  before  to  train  the  two  best 
performing  algorithms,  ANIFS  and  MI-ANFIS,  to  fuse  the  outputs  of  all  discriminators, 
i.e. ,  EHDDT,  EHDCT,  Fisher VectorDT  and  FisherVectorCT.  Fisher  Vector  based  methods 
extract  60  instances  per  GPR  alarm  (bag),  whereas  EHD  based  methods  extract  17  in- 
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Error  rate 


Figure  7.13:  A  plot  of  MI-ANFIS  RMSE  during  150  training  epochs. 

stances.  Thus,  Fisher  Vector  bags  contain  60  instances  and  EHD  bags  contain  17  instances, 
all  corresponding  to  the  same  GPR  alarm.  To  be  able  to  use  the  data  within  our  multiple 
instance  fusion  system,  we  expanded  the  EHD  instances  from  17  to  60  (by  taking  averages  of 
features  extracted  at  different  depths  but  corresponds  to  the  same  window  used  to  generate 
the  Fisher  vector  instances).  The  resulting  bag  has  60  4-D  instances.  Since  the  standard 
AFNIS  cannot  learn  from  partially  labeled  data,  an  expert  is  used  to  label  all  instances  of 
all  bags  within  the  training  data.  We  trained  and  tested  all  methods  using  10-fold  cross 
validation  on  the  same  data  collection  as  before. 

Figure  7.17  illustrates  the  resulting  ROCs.  As  it  can  be  seen,  MI-ANFIS  outper¬ 
formed  all  discriminators  and  the  standard  ANFIS  significantly.  The  performance  boost 
over  the  individual  discriminators  is  due  to  the  substantial  difference  between  the  EHD  and 
the  Fisher  Vector  features;  EHDDT  and  EHDCT  features  are  derived  from  the  standard 
MPEG-7  Edge  Histogram  Descriptors,  whereas  Fisher  Vector  is  fundamentally  based  on 
aggregating  SIFT  features.  Besides,  EHD  and  Fisher  Vector  methods  use  different  classi¬ 
fiers  to  assign  confidence  values  to  instances:  EHD  methods  use  a  possibilistic  KNN  rule 
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Figure  7.14:  Comparison  of  the  individual  discriminators  and  all  proposed  fuzzy  fusion 
methods. 


FAR  (scoring  area:  12383.3  m^)  x  iq'3 


Figure  7.15:  Comparison  of  the  individual  discriminators,  the  proposed  MI-Mamdani  , 
Mamdani,  and  NaiveMamdanifuzzy  fusion  methods. 


and  Fisher  Vector  methods  use  SVM.  Thus,  increasing  the  amount  of  information  available 
to  the  fusion  algorithms  and  per  consequence  increasing  positive  detections  while  lowering 
false  alarms  rates.  On  the  other  hand,  the  degraded  performance  of  the  standard  ANIFS  is 
linked  to  the  degraded  quality  of  the  labeling  of  instances.  The  number  of  instances  com¬ 
pared  to  the  previous  experiment  has  more  than  tripled  (60  vs  17),  making  assigning  correct 
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Figure  7.16:  Comparison  of  the  individual  discriminators,  the  proposed  MI-ANFIS  ,  ANFIS, 
and  NaiveANFIS  fuzzy  fusion  methods. 

labels  by  an  expert  an  increasingly  inaccurate  process.  Hence,  the  lower  performance. 

Thus  far  we  have  used  cross  validation  to  report  on  the  performance  of  the  proposed 
algorithms.  Typically,  cross  validation  is  an  adequate  technique  to  predict  the  performance 
of  a  given  model  on  unseen  examples.  However,  for  applications  such  as  landmine  detection, 
it  is  important  as  well  to  report  the  results  of  blind  testing  to  assess  how  well  a  model  per¬ 
forms  on  real  world  situations  -  outside  of  lab  settings.  In  the  following,  we  use  Collection-2 
to  train  and  test  our  fusion  methods.  The  collection  is  very  large  and  was  collected  at  three 
different  sites.  The  main  statistics  are  summarized  in  Table  7.3.  Collection-1  was  used  to 
train  ANIFS  and  MI-ANFIS  to  fuse  the  outputs  of  all  discriminators,  i.e. ,  EHDDT,  EHDCT, 
FisherVectorDT  and  FisherVectorCT.  Collection-2  was  exclusively  used  for  testing.  Figure 
7.18  shows  the  blind  test  ROCs. 

MI- ANIFS  showed  consistent  performance  in  the  blind  test.  It  outperformed  the 
individual  discriminators  and  the  standard  ANFIS  fusion.  In  spite  of  the  fact  that,  an 
expert  was  used  to  label  the  training  instances  for  ANFIS,  the  system  could  not  overcome 
the  ambiguity  associated  with  locating  the  target  depths  correctly  on  the  testing  data. 
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Figure  7.17:  Comparison  of  all  individual  discriminators,  ANFIS,  and  the  proposed  MI- 
ANFIS  fuzzy  fusion  methods. 


TABLE  7.3 
Data  Collections 


Collection-2 

Site  1  &  Site  2 

Site-3 

Phase 

Training 

Testing 

Total  alarms 

2366 

4967 

Mine  encounters 

789 

1948 

False  alarms 

1577 

3018 

Total  number  of  Instances 

141,960 

297,960 

Number  of  mine  instances 

47,340 

116,880 

Number  of  false  alarms  instances 

94,620 

181,080 
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EHDDT 
EHDCT 
FisherVectorDt 
FisherVectorCT 
MI-ANFIS  Fusion 
ANFIS  Fusion 


Figure  7.18:  Comparison  of  the  performances  of  all  individual  discriminators,  ANFIS,  and 
MI-ANFIS  fuzzy  fusion  methods  on  the  larger  collection. 
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CHAPTER  8 


CONCLUSIONS  AND  POTENTIAL  FUTURE  WORK 

8.1  Conclusions 

In  this  dissertation,  we  have  introduced  a  new  framework  to  accomplish  fuzzy  in¬ 
ference  with  multiple  instance  data.  In  multiple  instance  problems,  the  training  data  is 
ambiguously  labeled.  Instances  are  grouped  into  bags,  labels  of  bags  are  known  but  not 
those  of  individual  instances.  Our  work  generalizes  traditional  fuzzy  logic  and  fuzzy  sys¬ 
tems  to  enable  reasoning  with  bags  rather  than  single  instances.  The  following  sections 
summarize  our  contributions. 

8.1.1  Multiple  Instance  Fuzzy  Logic  (MI-FL) 

First,  we  have  presented  our  generalization  of  fuzzy  logic  to  enable  fuzzy  reasoning 
with  bags  of  instances  instead  of  a  single  instance  at  a  time.  In  particular,  we  have  in¬ 
troduced  multiple  instance  variations  of  fuzzy  propositions,  fuzzy  implication,  fuzzy  if-then 
rules,  and  fuzzy  reasoning.  These  building  blocks  are  then  used  to  derive  more  complex 
fuzzy  inference  systems.  Our  formalization  was  derived  using  a  thoroughly  and  abstract 
mathematical  formulation. 

Fuzzy  logic  is  powerful  at  modeling  knowledge  uncertainty  and  measurements  imprecision. 
More  generally,  it  is  one  of  the  best  frameworks  to  model  vagueness.  However,  in  addition 
to  uncertainty  and  imprecision,  there  is  a  third  vagueness  concept  that  standard  fuzzy  logic 
does  not  address  quiet  well.  This  vagueness  concept  is  due  to  the  ambiguity  that  arises 
when  data  have  multiple  forms  of  expression  as  is  the  case  for  multiple  instance  problems. 
Our  framework  deals  with  ambiguity  by  introducing  the  novel  concept  of  truth  instances: 
when  carrying  reasoning  using  multiple  instance  fuzzy  logic,  a  proposition  will  not  only 


132 


have  one  degree  of  truth,  it  will  have  multiple  degrees  of  truth,  we  call  truth  instances. 
Thus,  effectively  encoding  the  third  vagueness  component  of  ambiguity  and  increasing  the 
expressive  power  of  traditional  fuzzy  logic. 

8.1.2  Multiple  Instance  Fuzzy  Inference  Systems  (MI-FIS) 

The  traditional  Mamdani  and  Sugeno  inference  systems  outlined  in  chapter  2  are 
limited  to  reason  with  individual  instances.  First,  the  systems’  inputs  are  an  individual 
instance.  Second,  the  rules  describe  fuzzy  regions  within  the  instances’  space.  Third,  the 
outputs  of  the  systems  correspond  to  the  fuzzy  inference  using  the  D  dimensions  of  a  single 
instance.  Fourth,  labels  of  the  individual  instances  are  required  to  learn  the  parameters  of 
the  systems.  In  this  dissertation,  we  have  used  our  multiple  instance  fuzzy  logic  framework 
to  derived  multiple  instance  Mamdani  and  Sugeno  fuzzy  inference  styles  capable  of  handling 
MIL  problems  effectively.  In  addition,  we  have  presented  a  method  to  learn  multiple  instance 
rules  from  multiple  instance  data.  First,  we  use  the  FCMI  algorithm  to  extract  target 
concept  points  in  the  instances’  space.  Target  concepts  are  defined  as  regions  in  the  instance 
space  that  maximize  the  density  of  instances  from  positive  bags  and  minimizes  the  density 
of  instances  from  negative  bags.  Next,  the  target  concepts  are  transformed  into  multiple 
instance  fuzzy  rules.  This  approach  is  essentially  based  on  intuition.  Although  premise 
and  consequent  parameters  of  the  MI-FISs  are  learned  from  training  data,  the  processes  of 
identifying  both  set  of  parameters  are  independent. 

8.1.3  Multiple  Instance  Adaptive  Neuro-Fuzzy  Inference  System  (MI-ANFIS) 

Another  major  contribution  of  this  dissertation  is  the  MI-ANFIS,  a  novel  neuro-fuzzy 
architecture  that  extends  the  standard  Adaptive  Neuro-Fuzzy  Inference  System  (ANFIS) 
to  reason  with  bags  of  instances.  We  first  argued  that  the  standard  ANFIS  can  be  used 
in  the  context  of  MIL  only  if  bags  are  labeled  at  the  instances  level.  Unfortunately,  this 
process  is  tedious,  ambiguous,  subjective,  and  prone  to  errors. 

The  proposed  generalization,  MI-ANFIS,  deals  with  ambiguity  by  using  our  proposed  con- 
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cept  of  truth  instances.  Specifically,  when  carrying  reasoning  using  a  bag  of  instances  at 
Layer  2  (Figure  5.1),  a  proposition  will  not  only  have  one  degree  of  truth,  it  will  have  mul¬ 
tiple  degrees  of  truth  (r^).  Thus,  effectively  encoding  the  third  vagueness  component  of 
ambiguity  and  increasing  the  expressive  power  of  the  standard  ANFIS.  We  have  also  devel¬ 
oped  a  BackPropagation  learning  algorithm  and  showed  that  the  proposed  system  is  capable 
of  learning  meaningful  concepts  from  ambiguously  labeled  data.  Unlike  MI-FIS,  MI- ANFIS 
does  not  rely  on  any  traditional  MIL  clustering  algorithms  and  can  learn  simultaneously  its 
rule  base  from  data. 

8. 1.3.1  Rule  Dropout 

It  is  well-known  fact  that  neural  networks  with  large  number  of  parameters  are 
susceptible  to  overfitting.  MI- ANFIS  is  no  exception,  particularly  when  using  large  number 
of  multiple  instance  fuzzy  rules  and  relatively  small  training  datasets.  In  such  scenario, 
some  rules  could  co-adapt  (memorize)  to  the  training  data  and  degrade  the  network  ability 
to  generalize  to  unseen  examples.  In  situations  where  overfitting  is  imminent,  we  have 
proposed  a  regularization  technique,  we  called  Rule  Dropout,  and  showed  that  it  could 
be  used  to  train  MI-ANFIS  systems  with  better  generalization.  Rule  Dropout  works  by 
randomly  dropping  out  few  rules  (with  a  fixed  probability  1  —  p)  before  the  presentation  of 
a  given  training  sample.  During  testing,  all  rules  are  used  but  the  outputs  are  weighted  by 
the  probability  p.  Using  a  Rule  Dropout  strategy  is  approximatively  equivalent  to  sampling 
and  training  2R  (R  is  the  number  of  rules)  ensemble  of  MI-ANFIS  networks.  As  a  result,  a 
more  robust  generalization  can  be  achieved. 

8.1. 3.2  Multi-Class  MI-ANFIS  (MCMI-ANFIS) 

Initially  the  MI-ANFIS  has  been  proposed  and  developed  for  the  two-class  problem. 
We  have  also  presented  MCMI-ANFIS,  a  multiple  class  MI-ANIFS,  that  could  be  used 
to  solve  multiple  class  classification  problems  effectively.  Most  MIL  methods  deal  with 
these  type  of  problems  by  using  a  one  versus  all  training  pattern.  While  this  extension 
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is  straightforward  and  works  with  an  arbitrary  number  of  classes,  it  requires  an  extensive 
amount  of  preprocessing  to  relabel  the  data  and  generate  networks.  Moreover,  doing  so 
makes  the  data  unbalanced  and  sampling  should  be  used  before  training.  In  addition, 
some  classes  may  share  the  same  concepts,  therefore,  training  different  networks  may  lead 
to  redundant  rules  being  learned  and  wasting  CPU  cycles.  The  proposed  MCMI-ANFIS 
minimizes  a  negative  log  likelihood  criterion  to  learn  all  classes  simultaneously  reducing  the 
possibility  of  learning  redundant  rules. 

8.1.4  Validation 

Using  synthetic  and  benchmark  datasets  we  showed  that  the  proposed  Multiple  In¬ 
stance  Fuzzy  Inference  is  comparable  to  state  of  the  art  MI  machine  learning  algorithms. 
First,  using  a  synthetic  dataset  with  a  150  bags  of  which  100  are  positive,  we  showed  that  our 
MI-Mamdani  and  MI-ANFIS  can  learn  meaningful  multiple  instance  fuzzy  rules  describing 
positive  concepts.  Next,  using  five  benchmark  datasets  of  different  sizes  (size  varies  between 
92  bags  and  200  bags),  namely  MUSK1,  MUSK2,  FOX,  TIGER,  and  ELEPHANT  datasets, 
we  compared  the  performance  of  our  framework  to  other  19  state  of  the  art  MIL  algorithms. 
MI-ANFIS  outperformed  all  other  methods  on  the  FOX  and  ELEPHANT  benchmark,  oth¬ 
erwise  consistently  ranked  among  the  top-3  best  algorithms.  MI-MAMDANI  performed 
better  than  10  algorithms  out  of  19  tested  on  MUSK1,  it  also  showed  better  performance 
than  7  algorithms  out  of  9  algorithms  tested  on  FOX.  However,  MI-MAMDANI  did  not 
exhibit  consistent  performance  on  the  rest  of  the  benchmark  datasets.  Finally,  using  the 
COREL  dataset  (2000  bags)  we  applied  our  proposed  MCMI-ANFIS  to  the  problem  of 
region-based  image  categorization  and  showed  that  our  algorithm  exhibited  competitive 
performance  to  that  of  the  state  of  the  art. 

Additionally,  we  have  applied  our  proposed  multiple  instance  fuzzy  inference  frame¬ 
work  to  fuse  the  output  of  multiple  discrimination  algorithms  for  the  purpose  of  landmine 
detection  using  Ground  Penetrating  Radar.  In  this  problem,  discrimination  algorithms  de¬ 
tect  target  candidates  only  in  two-dimensions  (downtrack  and  cross-track  position).  Thus, 
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there  is  uncertainty  in  the  depth  estimation  of  the  targets  that  can  affect  both  the  training 
and  testing  phases  of  a  fusion  system.  For  training,  it  is  very  difficult  to  localize  the  objects 
depth  automatically,  and  it  is  a  very  tedious  process  to  do  it  manually.  Moreover,  each  GPR 
alarm  is  represented  as  a  bag  of  instances  extracted  at  multiple  depths.  Only  labels  for  the 
bags  are  available  as  binary  ground  truth:  target/non-target  (positive/negative).  There¬ 
fore,  we  have  used  our  multiple  instance  fuzzy  inference  framework  to  solve  this  problem 
effectively.  We  have  used  two  different  GPR  data  collections  to  measure  the  performance 
of  our  algorithms  on  this  problem.  The  first  collection  was  used  for  10-fold  cross  valida¬ 
tion,  whereas  the  second  collection  was  used  for  blind  testing,  hr  both  testing  scenarios, 
MI-ANFIS  outperformed  all  other  fusion  methods  that  we  have  proposed,  namely  the  MI- 
Mamdani,  the  standard  Mamdani,  and  the  standard  ANFIS  inference  systems. 

8.2  Potential  Future  Work 

Although  our  approach  is  fully  developed  and  has  shown  promising  results,  there  is 
still  room  for  improvement.  For  instance,  MI-ANFIS  uses  a  fixed  hyper-paranreter  a  to  con¬ 
trol  the  behavior  of  the  Softmax  function  in  Layer  3  and  Layer  5.  In  our  experiments  we  used 
a  =  1  to  replicate  the  conditions  of  the  standard  MIL  assumption  [36,39].  Future  research 
may  include  learning  this  hyper-parameter  online,  during  training,  which  may  offer  more 
flexibility  for  other  non  standard  applications  of  MI-ANFIS.  Another  hyper-parameter  that 
could  be  learned,  is  the  Rule  Dropout  rate  p.  Rule  Dropout  deemed  important  to  solve  large 
problems  such  as  multiple  class  classification  tasks.  Learning  this  hyper-paranreter  could 
improve  the  overall  generalization  capability  of  our  system.  This  task  could  be  achieved 
either  offline,  before  training  using  cross-validation  on  a  subset  of  data,  or  online  during 
training. 

Future  work  may  also  include  the  evaluation  of  our  framework  on  other  domains 
such  as  computer  audition  [56]  and  text  document  classification  [57].  In  these  applications, 
features  are  extracted  from  audio  segments  or  text  paragraphs,  and  labels  are  only  available 
at  the  audio  clip  level  or  text  document  level,  respectively,  making  them  MIL  problems. 
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